Cloud OCR

From Simple Wiki

SimpleIndex includes Amazon Textract cloud-based OCR for advanced print and handwriting recognition, forms data extraction, and invoice processing.

Textract Features[edit]

  • Highest accuracy of any available OCR engine
  • Recognition of both print and cursive handwriting
  • Automatic extraction of form field labels and values without templates
  • Automatic extraction of standard fields from Invoices and Receipts
  • Capture of line item data from Invoices
  • Convert documents to JSON with coordinates and location of all text

Additional Textract features can be added by request. These include lending document analysis, signature verification, table extraction, and queries (similar to ChatGPT). See the Customization page for details, or Contact Us to request a quote.

Limitations of Textract[edit]

While Textract enables a number of great new features, it does have some limitations.

  • Only single page TIFF images can be processed with Textract
  • Other file types must be converted to single page TIFF prior to processing
  • Searchable PDF output is not supported
  • Only asynchronous processing is available
  • No offline processing - must be connected to the Internet
  • AWS usage fees will be incurred for each page processed

Connect to Your AWS Account[edit]

Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.

Follow the directions on the Textract Getting Started Guide to connect SimpleIndex to your Textract account.

In summary the setup process is:

  1. Create an IAM user for Textract
  2. Obtain the Access Key and Secret Access Key for the Textract user account
  3. On the OCR Options tab of the Job Settings Wizard, select AWSForms, AWSText, or AWSInvoice as the OCR Engine
  4. Click the AWS Creds button to enter your User Access Key and Secret Access Key

To manually create the AWS credentials file under your user profile, follow these steps:

  1. Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
  2. Create a file called config (no file extension) with notepad and enter your region info
  3. Be sure to use the abbreviated version of the region name (e.g. us-east-1) and not the full name
  4. Create a file called credentials (no file extension) with notepad and enter your Access and Secret keys
  5. Copy the .aws folder and config files to the profile directory for any other accounts that will use it, including service accounts

Example config file:

region = us-east-1

Example credentials file:

aws_access_key_id = YOUR-IAM-USER-ACCESS-KEY
aws_secret_access_key = YOUR-IAM-USER-SECRET-KEY

AWSText Engine[edit]

In the OCR Options screen, set the OCR Engine to AWSText to enable basic full-text extraction.

This will use the lowest cost text extraction option, typically 1/4 - 1/5 of the cost of the AWSForms or AWSInvoice options.

Document text will be output to plain text files, with formatting designed to replicate the original document structure.

AWSForms Engine[edit]

Use the AWSForms option to extract key/value pairs for any detected form fields on your document.

Textract will automatically recognize any labeled field and extract both the text of the label and the corresponding value for each.

In the converted text, key/value pairs will be output as:

Label1Text~Field 1 value
Label2Text~Field 2 value

To capture the value to an index field, create an OCR field with Template matching using the following value:


If you have multiple possible label text corresponding to the same index field, you can enter multiple templates separated by a pipe "|" character. For example:

%AWS%|PO|Purchase Order:|PO Num

AWSInvoice Engine[edit]

The AWSInvoice engine is a specifically trained machine learning model that is designed to extract key information and line items from invoices.

It works similar to the AWSForms by extracting key/value pairs, but it standardizes the names for common invoice fields to avoid having to identify them by different label variations as you would with AWSForms.

Example output from an invoice is:


JSON Data[edit]

The JSON data for each document is appended to the text file following key/value pair list. This can be used to obtain additional data for any text, such as the confidence values or pixel coordinates. It can also be used to deserialize the JSON to an AnalyzeDocumentResponse object in the AWS SDK so you can interact with it programmatically.


SimpleIndex with Amazon Textract has a dual tiered license structure. First, the correct version of SimpleIndex needs to be purchased, which can be found on Second, a per image cost needs to be paid directly to Amazon. A link to an Amazon AWS account needs to be made to SimpleIndex through the SimpleIndex Job Configuration interface. Once the Amazon AWS Account and SimpleIndex are linked, processing files using the Amazon Textract Cloud OCR Engine in SimpleIndex the images that are processed will be kept count automatically on the Amazon AWS account. Amazon will directly charge this account for the total number of images processed.

Amazon AWS Pricing

Base Pricing

AWSText (Detect Document Text API) = $0.0015 per page / $1.50 per 1,000 pages
AWSForms (Analyze Document API - Forms) = $0.05 per page / $50.00 per 1,000 pages
AWSInovice (Analyze Expense API) = $0.10 per page / $10.00 per 1,000 pages