OCR

From Simple Wiki

OCR is a key function of SimpleIndex, with a number of features and configuration options to consider.

OCR Features & Settings[edit | edit source]

OCR Overview[edit | edit source]

Zone OCR solutions traditionally require you to specify a region on the page where index information is found. This region is recognized and the result is inserted into an index field. The problem with traditional zone OCR is that if the region is moved slightly due to variations in scanning, the result could contain extra neighboring characters or cut off desired characters. This limits the usefulness of traditional zone OCR to documents where the index value is in the exact same place every time and has plenty of white space around it.

SimpleIndex’s OCR contains many advanced features to overcome the inherent limitations of zone OCR. This is done by providing template and dictionary matching for OCR fields. These features search the OCR results for a certain pattern or list of possible values and return only the matching data. This allows you to draw your OCR zones much larger than normal, ensuring that no matter how much the data shifts around it will always be contained within that region.

It is even possible to search the entire page and find key information that is not printed in any fixed location. For example, a doctor’s office may receive lab reports from many different labs. Each report is formatted differently, but each contains the patient’s name somewhere on it. Using the dictionary matching feature with a patient name list, SimpleIndex can identify the correct patient for each lab automatically.

For data that has no predictable location or format, point and click OCR can be used to capture the information by clicking or drawing a box around the text on the image.

When implementing OCR for document automation, carefully consider the data you are trying to recognize. Is the text legible? Does it appear in a fixed location? Does it conform to a unique pattern that won’t be found anywhere else on the page? Is there a list available with all the possible values for this field? Answer these questions, and you will know which OCR approach is best for your application.

Licensing[edit | edit source]

The Tesseract OCR engine is included with all versions of SimpleIndex.

The FineReader OCR and ICR Handprint Recognition engine is included with the OCR add-on or Professional license.

Unattended Processing with OCR requires a Server license based on annual processing volume, in increments of 1 Million pages per year.

Cloud OCR requires an add-on license and an Amazon AWS account. While the SimpleIndex Cloud OCR license has no page limit, standard AWS Textract processing charges will apply.

Creating OCR Configurations Training Video[edit | edit source]

Takes a look under the hood of the Zone OCR sample job to see how it is configured. Learn to draw OCR zones and create basic templates.

Related Knowledge Base Articles[edit | edit source]