OCR: Difference between revisions

Latest revision as of 15:30, 5 April 2024

OCR is a key function of SimpleIndex, with a number of features and configuration options to consider.

OCR Features & Settings[edit | edit source]

Zone OCR read data in a specific location
Handprint Recognition using ICR technology
Cloud OCR with Amazon AWS Textract
Template matching to match unique patterns
Dictionary matching to match a list of possible values
OCR Options OCR job settings that apply to all fields
File Formats that can be output by OCR
Languages supported by OCR
FineReader versus Tesseract OCR engines
Searchable PDF with MRC compression
OCR to Field for point and click OCR during verification

OCR Overview[edit | edit source]

Zone OCR solutions traditionally require you to specify a region on the page where index information is found. This region is recognized and the result is inserted into an index field. The problem with traditional zone OCR is that if the region is moved slightly due to variations in scanning, the result could contain extra neighboring characters or cut off desired characters. This limits the usefulness of traditional zone OCR to documents where the index value is in the exact same place every time and has plenty of white space around it.

SimpleIndex’s OCR contains many advanced features to overcome the inherent limitations of zone OCR. This is done by providing template and dictionary matching for OCR fields. These features search the OCR results for a certain pattern or list of possible values and return only the matching data. This allows you to draw your OCR zones much larger than normal, ensuring that no matter how much the data shifts around it will always be contained within that region.

It is even possible to search the entire page and find key information that is not printed in any fixed location. For example, a doctor’s office may receive lab reports from many different labs. Each report is formatted differently, but each contains the patient’s name somewhere on it. Using the dictionary matching feature with a patient name list, SimpleIndex can identify the correct patient for each lab automatically.

For data that has no predictable location or format, point and click OCR can be used to capture the information by clicking or drawing a box around the text on the image.

When implementing OCR for document automation, carefully consider the data you are trying to recognize. Is the text legible? Does it appear in a fixed location? Does it conform to a unique pattern that won’t be found anywhere else on the page? Is there a list available with all the possible values for this field? Answer these questions, and you will know which OCR approach is best for your application.

Licensing[edit | edit source]

The Tesseract OCR engine is included with all versions of SimpleIndex.

The FineReader OCR and ICR Handprint Recognition engine is included with the OCR add-on or Professional license.

Unattended Processing with OCR requires a Server license based on annual processing volume, in increments of 1 Million pages per year.

Cloud OCR requires an add-on license and an Amazon AWS account. While the SimpleIndex Cloud OCR license has no page limit, standard AWS Textract processing charges will apply.

Creating OCR Configurations Training Video[edit | edit source]

Takes a look under the hood of the Zone OCR sample job to see how it is configured. Learn to draw OCR zones and create basic templates.

@@ Line 4: / Line 4: @@
 * [[Zone OCR]] read data in a specific location
+* [[Handprint Recognition]] using [[ICR]] technology
+* [[Cloud OCR]] with Amazon AWS Textract
 * [[Template]] matching to match unique patterns
 * [[Dictionary]] matching to match a list of possible values
-* [[OCR Options]] configuring OCR job settings
+* [[OCR Options]] OCR job settings that apply to all fields
 * [[File_Formats#Full_Page_OCR_Formats|File Formats]] that can be output by OCR
 * [[Languages]] supported by OCR
 * [[FineReader]] versus [[Tesseract]] OCR engines
 * [[Searchable PDF]] with [[MRC]] compression
+* [[OCR to Field]] for [[point and click OCR]] during [[verification]]
 == OCR Overview ==
@@ Line 23: / Line 26: @@
 When implementing OCR for document automation, carefully consider the data you are trying to recognize. Is the text legible?  Does it appear in a fixed location?  Does it conform to a unique pattern that won’t be found anywhere else on the page?  Is there a list available with all the possible values for this field?  Answer these questions, and you will know which OCR approach is best for your application.
+== Licensing ==
+The [[Tesseract]] OCR engine is included with all versions of SimpleIndex.
+The [[FineReader]] OCR and [[ICR]] [[Handprint Recognition]] engine is included with the OCR add-on or Professional license.
+[[Unattended Processing]] with OCR requires a Server license based on annual processing volume, in increments of 1 Million pages per year.
+[[Cloud OCR]] requires an add-on license and an Amazon AWS account. While the SimpleIndex [[Cloud OCR]] license has no page limit, standard AWS Textract processing charges will apply.
+== Creating OCR Configurations Training Video ==
+Takes a look under the hood of the Zone OCR sample job to see how it is configured. Learn to draw OCR zones and create basic templates.
+<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"><youtube>edpKxcMipOI</youtube></div>
+== Related Knowledge Base Articles ==
+* [https://www.simpleindex.com/knowledge-base/how-can-i-improve-recognition-rates-for-my-ocr-fields/ How can I improve recognition rates for my OCR fields?]
+* [https://www.simpleindex.com/knowledge-base/can-simpleindex-create-searchable-pdf-imagetext-files-with-hidden-text/ Can SimpleIndex create searchable PDF Image+Text files with hidden text?]