Cloud OCR: Difference between revisions

Latest revision as of 17:27, 7 March 2024

SimpleIndex includes Amazon Textract cloud-based OCR for advanced print and handwriting recognition, forms data extraction, and invoice processing.

Textract Features[edit | edit source]

Highest accuracy of any available OCR engine
Recognition of both print and cursive handwriting
Automatic extraction of form field labels and values without templates
Automatic extraction of standard fields from Invoices and Receipts
Capture of line item data from Invoices
Convert documents to JSON with coordinates and location of all text

Additional Textract features can be added by request. These include lending document analysis, signature verification, table extraction, and queries (similar to ChatGPT). See the Customization page for details, or Contact Us to request a quote.

Limitations of Textract[edit | edit source]

While Textract enables a number of great new features, it does have some limitations.

Only asynchronous processing is available
No offline processing - must be connected to the Internet
AWS usage fees will be incurred for each page processed

Connect to Your AWS Account[edit | edit source]

Using Textract requires an AWS account. AWS offers a three month free account that can process 100 pages per month. After the trial, you will incur charges for any documents processed using the Textract OCR option.

Once you've created an AWS account, follow Steps 1 and 2 in the Textract Getting Started Guide to create the credentials for linking SimpleIndex to your Textract account (step 3, updating the shared credentials can be performed in SimpleIndex, as shown next.)

In summary the setup process is:

Create an IAM user for Textract
Obtain the Access Key and Secret Access Key for the Textract user account
On the OCR Options tab of the Job Settings Wizard, select AWSForms, AWSText, or AWSInvoice as the OCR Engine
Click the AWS Creds button to enter your Region, User Access Key and Secret Access Key

To manually create the AWS credentials file under your user profile, follow these steps:

Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
Create a file called config (no file extension) with notepad and enter your region info
Be sure to use the abbreviated version of the region name (e.g. us-east-1) and not the full name
Create a file called credentials (no file extension) with notepad and enter your Access and Secret keys
Copy the .aws folder and config files to the profile directory for any other accounts that will use it, including service accounts

Example config file:

[default]
region = us-east-1

Example credentials file:

[default]
aws_access_key_id = YOUR-IAM-USER-ACCESS-KEY
aws_secret_access_key = YOUR-IAM-USER-SECRET-KEY

AWSText Engine[edit | edit source]

In the OCR Options screen, set the OCR Engine to AWSText to enable basic full-text extraction.

This will use the lowest cost text extraction option, typically 1/4 - 1/5 of the cost of the AWSForms or AWSInvoice options.

Document text will be output to plain text files, with formatting designed to replicate the original document structure.

AWSForms Engine[edit | edit source]

Use the AWSForms option to extract key/value pairs for any detected form fields on your document.

Textract will automatically recognize any labeled field and extract both the text of the label and the corresponding value for each.

In the converted text, key/value pairs will be output as:

Label1Text~Field 1 value
Label2Text~Field 2 value

To capture the value to an index field, create an OCR field with Template matching using the following value:

%AWS%|Label1Text

If you have multiple possible label text corresponding to the same index field, you can enter multiple templates separated by a pipe "|" character. For example:

%AWS%|PO|Purchase Order:|PO Num

AWSInvoice Engine[edit | edit source]

The AWSInvoice engine is a specifically trained machine learning model that is designed to extract key information and line items from invoices.

It works similar to the AWSForms by extracting key/value pairs, but it standardizes the names for common invoice fields to avoid having to identify them by different label variations as you would with AWSForms.

Example output from an invoice is:

VENDOR_NAME~DOCUMENT SERVICES
TOTAL~$372.00
RECEIVER_ADDRESS~BILL TO: YOUR CUSTOMER 123 5TH AVENUE NEW YORK NY 10012
INVOICE_RECEIPT_DATE~07/31/2021
INVOICE_RECEIPT_ID~210743
PAYMENT_TERMS~30 DAYS
SUBTOTAL~$372.00
TAX~$0.00
LINE1EXPENSE_ROW~DOCUMENT CONVERSION 31.00 $12.00DOC $372.00 $372.00

JSON Data[edit | edit source]

The JSON data for each document is appended to the text file following key/value pair list. This can be used to obtain additional data for any text, such as the confidence values or pixel coordinates. It can also be used to deserialize the JSON to an AnalyzeDocumentResponse object in the AWS SDK so you can interact with it programmatically.

Pricing[edit | edit source]

SimpleIndex with Amazon Textract has a dual tiered license structure. First, the correct version of SimpleIndex needs to be purchased, which can be found on SimpleIndex.com. Second, a per image cost needs to be paid directly to Amazon. A link to an Amazon AWS account needs to be made to SimpleIndex through the SimpleIndex Job Configuration interface. Once the Amazon AWS Account and SimpleIndex are linked, processing files using the Amazon Textract Cloud OCR Engine in SimpleIndex the images that are processed will be kept count automatically on the Amazon AWS account. Amazon will directly charge this account for the total number of images processed.

Amazon AWS Pricing

Base Pricing

AWSText (Detect Document Text API) = $0.0015 per page / $1.50 per 1,000 pages
AWSForms (Analyze Document API - Forms) = $0.05 per page / $50.00 per 1,000 pages
AWSInovice (Analyze Expense API) = $0.10 per page / $10.00 per 1,000 pages

Amazon Textract integration into SimpleIndex Video[edit | edit source]

Video was recorded in a previous version of SimpleIndex. Refer to the wiki documentation for latest updates.

@@ Line 1: / Line 1: @@
-SimpleIndex 10.1 adds Amazon Textract cloud-based OCR to the available OCR Engines.
+SimpleIndex includes Amazon Textract cloud-based [[OCR]] for advanced print and handwriting recognition, forms data extraction, and invoice processing.
 == Textract Features ==
 * Highest accuracy of any available OCR engine
-* Recognition of both print and [[cursive]] [[handwriting]]
+* Recognition of both print and cursive handwriting
 * Automatic extraction of form field labels and values without templates
-* Automatic extraction of standard fields from [[Invoices]] and Receipts
+* Automatic extraction of standard fields from Invoices and Receipts
-* Capture of line item data from [[Invoices]]
+* Capture of line item data from Invoices
 * Convert documents to [[JSON]] with coordinates and location of all text
+Additional Textract features can be added by request. These include lending document analysis, signature verification, table extraction, and queries (similar to [[ChatGPT]]). See the [[Customization]] page for details, or [https://www.simpleindex.com/contact-us/ Contact Us] to request a quote.
 == Limitations of Textract ==
@@ Line 14: / Line 16: @@
 While Textract enables a number of great new features, it does have some limitations.
-* Only single page TIFF images can be processed with Textract
-* Other file types must be converted to single page TIFF prior to processing
-* [[Searchable PDF]] output is not supported
 * Only asynchronous processing is available
 * No offline processing - must be connected to the Internet
@@ Line 23: / Line 22: @@
 == Connect to Your AWS Account ==
-Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.
+Using Textract requires an AWS account. AWS offers a three month [https://aws.amazon.com/textract/pricing/?refid=ft_textract#Free_Tier free account] that can process 100 pages per month. After the trial, you will incur [https://aws.amazon.com/textract/pricing/?refid=ft_textract#Pricing_examples_outside_the_free_tier charges] for any documents processed using the Textract OCR option.
-Follow the directions on the [https://docs.aws.amazon.com/textract/latest/dg/getting-started.html Textract Getting Started Guide] to connect SimpleIndex to your Textract account.
+Once you've created an AWS account, follow Steps 1 and 2 in the [https://docs.aws.amazon.com/sdkref/latest/guide/access-iam-users.html Textract Getting Started Guide] to create the credentials for linking SimpleIndex to your Textract account (step 3, updating the shared credentials can be performed in SimpleIndex, as shown next.)
 In summary the setup process is:
@@ Line 31: / Line 30: @@
 # Create an IAM user for Textract
 # Obtain the Access Key and Secret Access Key for the Textract user account
+# On the [[OCR Options]] tab of the [[Job Settings Wizard]], select AWSForms, AWSText, or AWSInvoice as the [[OCR]] Engine
+# Click the ''AWS Creds'' button to enter your Region, User Access Key and Secret Access Key
+To manually create the AWS credentials file under your user profile, follow these steps:
 # Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
 # Create a file called config (no file extension) with notepad and enter your region info
@@ Line 67: / Line 71: @@
 Label2Text~Field 2 value
-To capture the value to an [[index field]], create an OCR field with [[Template]] matching using the following [[Regular Expression]]:
+To capture the value to an [[index field]], create an OCR field with [[Template]] matching using the following value:
-(?<=Label1Text~).*
+%AWS%|Label1Text
 If you have multiple possible label text corresponding to the same [[index field]], you can enter multiple templates separated by a pipe "|" character. For example:
-(?<=PO~).*|(?<=Purchase Order:~).*|(?<=PO Num~).*
+%AWS%|PO|Purchase Order:|PO Num
 == AWSInvoice Engine ==
@@ Line 96: / Line 100: @@
 The JSON data for each document is appended to the text file following key/value pair list. This can be used to obtain additional data for any text, such as the confidence values or pixel coordinates. It can also be used to deserialize the JSON to an AnalyzeDocumentResponse object in the AWS SDK so you can interact with it programmatically.
+== Pricing ==
+SimpleIndex with Amazon Textract has a dual tiered license structure.  First, the correct version of SimpleIndex needs to be purchased, which can be found on [https://www.simpleindex.com/shop/ SimpleIndex.com].  Second, a per image cost needs to be paid directly to Amazon.  A [[link to an Amazon AWS account needs to be made to SimpleIndex through the SimpleIndex Job Configuration interface]].  Once the Amazon AWS Account and SimpleIndex are linked, processing files using the Amazon Textract Cloud OCR Engine in SimpleIndex the images that are processed will be kept count automatically on the Amazon AWS account.  Amazon will directly charge this account for the total number of images processed.
+<b>[https://aws.amazon.com/textract/pricing/ Amazon AWS Pricing]</b>
+''Base Pricing''
+<b>AWSText</b> (Detect Document Text API) = $0.0015 per page / $1.50 per 1,000 pages<br>
+<b>AWSForms</b> (Analyze Document API - Forms) = $0.05 per page / $50.00 per 1,000 pages<br>
+<b>AWSInovice</b> (Analyze Expense API) = $0.10 per page / $10.00 per 1,000 pages
+== Amazon Textract integration into SimpleIndex Video ==
+Video was recorded in a previous version of SimpleIndex. Refer to the wiki documentation for latest updates.
+<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><youtube>j8vzil3sZ-c</youtube></div>
+<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><youtube>https://youtu.be/2-j4niG3eKA?si=nC0or8QsL7srbSJD</youtube></div>