Cloud OCR

From Simple Wiki

SimpleIndex 10.1 adds Amazon Textract cloud-based OCR to the available OCR Engines.

Textract Features[edit | edit source]

  • Highest accuracy of any available OCR engine
  • Recognition of both print and cursive handwriting
  • Automatic extraction of form field labels and values without templates
  • Automatic extraction of standard fields from Invoices and Receipts
  • Capture of line item data from Invoices
  • Convert documents to JSON with coordinates and location of all text

Limitations of Textract[edit | edit source]

While Textract enables a number of great new features, it does have some limitations.

  • Only single page TIFF images can be processed with Textract
  • Other file types must be converted to single page TIFF prior to processing
  • Searchable PDF output is not supported
  • Only asynchronous processing is available
  • No offline processing - must be connected to the Internet
  • AWS usage fees will be incurred for each page processed

Connect to Your AWS Account[edit | edit source]

Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.

Follow the directions on the Textract Getting Started Guide to connect SimpleIndex to your Textract account.

In summary the setup process is:

  1. Create an IAM user for Textract
  2. Obtain the Access Key and Secret Access Key for the Textract user account
  3. Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
  4. Create a file called config (no file extension) with notepad and enter your region info
  5. Be sure to use the abbreviated version of the region name (e.g. us-east-1) and not the full name
  6. Create a file called credentials (no file extension) with notepad and enter your Access and Secret keys
  7. Copy the .aws folder and config files to the profile directory for any other accounts that will use it, including service accounts

Example config file:

[default]
region = us-east-1

Example credentials file:

[default]
aws_access_key_id = YOUR-IAM-USER-ACCESS-KEY
aws_secret_access_key = YOUR-IAM-USER-SECRET-KEY

AWSText Engine[edit | edit source]

In the OCR Options screen, set the OCR Engine to AWSText to enable basic full-text extraction.

This will use the lowest cost text extraction option, typically 1/4 - 1/5 of the cost of the AWSForms or AWSInvoice options.

Document text will be output to plain text files, with formatting designed to replicate the original document structure.

AWSForms Engine[edit | edit source]

Use the AWSForms option to extract key/value pairs for any detected form fields on your document.

Textract will automatically recognize any labeled field and extract both the text of the label and the corresponding value for each.

In the converted text, key/value pairs will be output as:

Label1Text~Field 1 value
Label2Text~Field 2 value

To capture the value to an index field, create an OCR field with Template matching using the following Regular Expression:

(?<=Label1Text~).*

If you have multiple possible label text corresponding to the same index field, you can enter multiple templates separated by a pipe "|" character. For example:

(?<=PO~).*|(?<=Purchase Order:~).*|(?<=PO Num~).*

AWSInvoice Engine[edit | edit source]

The AWSInvoice engine is a specifically trained machine learning model that is designed to extract key information and line items from invoices.

It works similar to the AWSForms by extracting key/value pairs, but it standardizes the names for common invoice fields to avoid having to identify them by different label variations as you would with AWSForms.

Example output from an invoice is:

VENDOR_NAME~DOCUMENT SERVICES
TOTAL~$372.00
RECEIVER_ADDRESS~BILL TO: YOUR CUSTOMER 123 5TH AVENUE NEW YORK NY 10012
INVOICE_RECEIPT_DATE~07/31/2021
INVOICE_RECEIPT_ID~210743
PAYMENT_TERMS~30 DAYS
SUBTOTAL~$372.00
TAX~$0.00
LINE1EXPENSE_ROW~DOCUMENT CONVERSION 31.00 $12.00DOC $372.00 $372.00

JSON Data[edit | edit source]

The JSON data for each document is appended to the text file following key/value pair list. This can be used to obtain additional data for any text, such as the confidence values or pixel coordinates. It can also be used to deserialize the JSON to an AnalyzeDocumentResponse object in the AWS SDK so you can interact with it programmatically.