OCR: Difference between revisions

From Simple Wiki
No edit summary
No edit summary
Line 10: Line 10:


When implementing OCR for document automation, carefully consider the data you are trying to recognize. Is the text legible?  Does it appear in a fixed location?  Does it conform to a unique pattern that won’t be found anywhere else on the page?  Is there a list available with all the possible values for this field?  Answer these questions, and you will know which OCR approach is best for your application.
When implementing OCR for document automation, carefully consider the data you are trying to recognize. Is the text legible?  Does it appear in a fixed location?  Does it conform to a unique pattern that won’t be found anywhere else on the page?  Is there a list available with all the possible values for this field?  Answer these questions, and you will know which OCR approach is best for your application.
== OCR Topics ==
* [[Zone OCR]] read data in a specific location
* [[Template]] matching to match unique patterns
* [[Dictionary]] matching to match a list of possible values
* [[OCR Options]] configuring OCR job settings
* [[File_Formats#Full_Page_OCR_Formats|File Formats]] that can be output by OCR
* [[Languages]] supported by OCR
* [[FineReader]] versus [[Tesseract]] OCR engines
* [[Searchable PDF]] with [[MRC]] compression

Revision as of 10:09, 14 January 2022

OCR Overview[edit | edit source]

Zone OCR solutions traditionally require you to specify a region on the page where index information is found. This region is recognized and the result is inserted into an index field. The problem with traditional zone OCR is that if the region is moved slightly due to variations in scanning, the result could contain extra neighboring characters or cut off desired characters. This limits the usefulness of traditional zone OCR to documents where the index value is in the exact same place every time and has plenty of white space around it.

SimpleIndex’s OCR contains many advanced features to overcome the inherent limitations of zone OCR. This is done by providing template and dictionary matching for OCR fields. These features search the OCR results for a certain pattern or list of possible values and return only the matching data. This allows you to draw your OCR zones much larger than normal, ensuring that no matter how much the data shifts around it will always be contained within that region.

It is even possible to search the entire page and find key information that is not printed in any fixed location. For example, a doctor’s office may receive lab reports from many different labs. Each report is formatted differently, but each contains the patient’s name somewhere on it. Using the dictionary matching feature with a patient name list, SimpleIndex can identify the correct patient for each lab automatically.

For data that has no predictable location or format, point and click OCR can be used to capture the information by clicking or drawing a box around the text on the image.

When implementing OCR for document automation, carefully consider the data you are trying to recognize. Is the text legible? Does it appear in a fixed location? Does it conform to a unique pattern that won’t be found anywhere else on the page? Is there a list available with all the possible values for this field? Answer these questions, and you will know which OCR approach is best for your application.

OCR Topics[edit | edit source]