Cloud OCR: Difference between revisions

From Simple Wiki
(Created page with "SimpleIndex 10.1 adds Amazon Textract cloud-based OCR to enable the following features in SimpleIndex: * Highest accuracy of any available OCR engine * Recognition of both pr...")
 
No edit summary
Line 1: Line 1:
SimpleIndex 10.1 adds Amazon Textract cloud-based OCR to enable the following features in SimpleIndex:
SimpleIndex 10.1 adds Amazon Textract cloud-based OCR to the available [[OCR Engines]].
 
== Textract Features ==


* Highest accuracy of any available OCR engine
* Highest accuracy of any available OCR engine
* Recognition of both print and cursive handwriting
* Recognition of both print and [[cursive]] [[handwriting]]
* Automatic extraction of form field labels and values without templates
* Automatic extraction of form field labels and values without templates
* Automatic extraction of standard fields from Invoices and Receipts
* Automatic extraction of standard fields from [[Invoices]] and Receipts
* Capture of line item data from Invoices
* Capture of line item data from [[Invoices]]
* Convert documents to JSON with coordinates and location of all text
* Convert documents to [[JSON]] with coordinates and location of all text


== Requirements to Use Textract ==
== Limitations of Textract ==
 
While Textract enables a number of great new features, it does have some limitations.
 
* Only single page TIFF images can be processed with Textract
* Other file types must be converted to single page TIFF prior to processing
* [[Searchable PDF]] output is not supported
* Only asynchronous processing is available
* No offline processing - must be connected to the Internet
* AWS usage fees will be incurred for each page processed


== Connect to Your AWS Account ==
== Connect to Your AWS Account ==
Line 14: Line 25:
Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.
Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.


Follow the directions on the [https://docs.aws.amazon.com/textract/latest/dg/getting-started.html Textract Getting Started Guide] to connect SimpleIndex to your Textract account.
In summary the setup process is:
# Create an IAM user for Textract
# Obtain the Access Key and Secret Access Key for the Textract user account
# Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
# Create a file called config (no file extension) with notepad and enter your region info
# Create a file called credentials (no file extension) with notepad and enter your Access and Secret keys
# Copy the .aws folder and config files to the profile directory for any other accounts that will use it, including service accounts
Example config file:
[default]
region = us-east-1
Example credentials file:
[default]
aws_access_key_id = YOUR-IAM-USER-ACCESS-KEY
aws_secret_access_key = YOUR-IAM-USER-SECRET-KEY


== AWSText Engine ==
== AWSText Engine ==

Revision as of 09:58, 29 April 2022

SimpleIndex 10.1 adds Amazon Textract cloud-based OCR to the available OCR Engines.

Textract Features[edit | edit source]

  • Highest accuracy of any available OCR engine
  • Recognition of both print and cursive handwriting
  • Automatic extraction of form field labels and values without templates
  • Automatic extraction of standard fields from Invoices and Receipts
  • Capture of line item data from Invoices
  • Convert documents to JSON with coordinates and location of all text

Limitations of Textract[edit | edit source]

While Textract enables a number of great new features, it does have some limitations.

  • Only single page TIFF images can be processed with Textract
  • Other file types must be converted to single page TIFF prior to processing
  • Searchable PDF output is not supported
  • Only asynchronous processing is available
  • No offline processing - must be connected to the Internet
  • AWS usage fees will be incurred for each page processed

Connect to Your AWS Account[edit | edit source]

Using Textract requires an AWS account, which will incur charges for any documents processed using the Textract OCR option.

Follow the directions on the Textract Getting Started Guide to connect SimpleIndex to your Textract account.

In summary the setup process is:

  1. Create an IAM user for Textract
  2. Obtain the Access Key and Secret Access Key for the Textract user account
  3. Create the folder c:\Users\xxx\.aws (replacing xxx with your Windows user name)
  4. Create a file called config (no file extension) with notepad and enter your region info
  5. Create a file called credentials (no file extension) with notepad and enter your Access and Secret keys
  6. Copy the .aws folder and config files to the profile directory for any other accounts that will use it, including service accounts

Example config file:

[default] region = us-east-1

Example credentials file:

[default] aws_access_key_id = YOUR-IAM-USER-ACCESS-KEY aws_secret_access_key = YOUR-IAM-USER-SECRET-KEY

AWSText Engine[edit | edit source]

AWSForms Engine[edit | edit source]

AWSInvoice Engine[edit | edit source]

VENDOR_NAME~AMERICAN WASTE MANAGEMENT SERVICES TOTAL~$372.00 RECEIVER_ADDRESS~BILL TO: P-8973 AMERICAN REFINING GROUP 77 NORTH KENDALL AVE BRADFORD PA 16701 INVOICE_RECEIPT_DATE~07/31/2021 INVOICE_RECEIPT_ID~210743 PAYMENT_TERMS~30 DAYS SUBTOTAL~$372.00 TAX~$0.00 LINE1EXPENSE_ROW~JULY2021 RENTAL 07/31/2021 7/1-/31/2021 31.00 $12.00DAY $372.00 $372.00