Amazon Textract OCR and ICR

What is Amazon Textract?

Amazon Textract is a service that automatically detects and extracts text and data from scanned documents. It goes beyond simple optical character recognition (ocr) to also identify the contents of fields in forms and information stored in tables

Benefits

Extract data quickly and accurately
Amazon Textract makes it easy to quickly and accurately extract data from documents and forms. Amazon Textract automatically detects a document's layout and the key elements on the page, understands the data relationships in any embedded forms or tables, and extracts everything with its context intact. This means you can instantly use the extracted data in an application or store it in a database without a lot of complicated code in between
No code or templates to maintain
With Amazon Textract's pre-trained machine learning models, you don't need to write code for data extraction. This is because the models have already been trained on tens of millions of documents from many industries—including invoices, receipts, contracts, tax documents, sales orders, enrollment forms, benefit applications, insurance claims, and policy documents. You no longer need to maintain code for every document or form you might receive, or worry about how page layouts change over time.
Easily implement human reviews
With the addition of Amazon Augmented AI you can build-in human reviews to manage nuanced or sensitive workflows that require human judgement to get high confidence predictions or to audit predictions on an on-going basis.
Lower document processing costs
Amazon Textract's text extraction API enables you to process documents for $1.50 per 1,000 pages. Whether you process a few hundred documents a year or millions, Amazon Textract provides OCR and structured data extraction (forms and tables) at a very low cost, and you only pay for what you use. There are no upfront commitments or long-term contracts.

How Does Amazon Textract work?

Amazon AWS Textract Cloud OCR Batch Processing Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Today, many companies manually extract data from scanned documents such as PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration (which often must be updated when the form changes). To overcome these manual and expensive processes, Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort. You can quickly automate document processing and act on the information extracted, whether you're automating loans processing or extracting information from invoices and receipts. Textract can extract the data in minutes instead of hours or days. Additionally, you can add human reviews with Amazon Augmented AI to provide oversight of your models and check sensitive data., Google needs to be happy with the images you use. There is no need to upset the algorithm so that your page ends up at the bottom of the search results.

Limitations of Textract

While Textract enables a number of great new features, it does have some limitations.

Only single page tiff images can be processed with Textract
Other file types must be converted to single page TIFF prior to processing
Searchable pdf output is not supported
Only asynchronous processing is available
No offline processing – must be connected to the Internet
AWS usage fees will be incurred for each page processed

Textract Integration with SimpleIndex

The Textract integration feature enables the Amazon AWS Textract OCR engine that has the ability to read unconstrained print and scripted handwriting, with surprisingly good accuracy.

It can be purchased separately or included with SimpleIndex Professional.

Textract is only available as an API, requiring custom programming to make it work. SimpleIndex turns it into a complete document and data capture application designed for easy batch processing on a workstation or server.

Extract text from typed or handwritten documents automatically, even on unconstrained handprint and cursive writing. Automatic extraction of form fields lets you identify key values without templates or training. Accounts payable invoice and receipt processing is also included.

Captured data can be used to organize files into folders for cloud storage apps, save to a csv, XML or JSON file, export to a database, upload to a document management system, perform full-text searching, or even create bookmarks in pdf files.

Connect to Your AWS Account

Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.

Follow the directions on the Textract Getting Started Guide to connect SimpleIndex to your Textract account.

In summary the setup process is:

Create an IAM user for Textract
Obtain the Access Key and Secret Access Key for the Textract user account
On the OCR Options tab of the Job Settings Wizard, select AWSForms, AWSText, or AWSInvoice as the OCR Engine
Click the AWS Creds button to enter your User Access Key and Secret Access Key

To manually create the AWS credentials file under your user profile, follow these steps:

Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
Create a file called config (no file extension) with notepad and enter your region info
Be sure to use the abbreviated version of the region name (e.g. us-east-1) and not the full name
Create a file called credentials (no file extension) with notepad and enter your Access and Secret keys
Copy the .aws folder and config files to the profile directory for any other accounts that will use it, including service accounts

Pricing

SimpleIndex with Amazon Textract has a dual tiered license structure. First, the correct version of SimpleIndex needs to be purchased, which can be found on SimpleIndex.com. Second, a per image cost needs to be paid directly to Amazon. A link to an Amazon AWS account needs to be made to SimpleIndex through the SimpleIndex Job Configuration interface. Once the Amazon AWS Account and SimpleIndex are linked, processing files using the Amazon Textract Cloud OCR Engine in SimpleIndex the images that are processed will be kept count automatically on the Amazon AWS account. Amazon will directly charge this account for the total number of images processed.

Amazon AWS Pricing

These Prices are issued by Amazon based on the US region, and are subject to change.

AWSText (Detect Document Text API) = $0.0015 per page / $1.50 per 1,000 pages
AWSForms (Analyze Document API – Forms) = $0.05 per page / $50.00 per 1,000 pages
AWSInovice (Analyze Expense API) = $0.10 per page / $10.00 per 1,000 pages