Optical Character Recognition
During your foray into the world of document scanning, you’ve likely encountered the term “OCR” and may even know that it stands for “Optical Character Recognition“. But what exactly is OCR and how can you make the best use of this sophisticated and valuable tool?
We’re here to give you a run-down of what you need to know about Optical Character Recognition, answer any questions you might have, and recommend the best OCR software solution for your scanning project. Let’s begin!
What is OCR?
The primary purpose of Optical Character Recognition is to quickly and automatically recognize and convert images of machine-printed or typed text into actual electronic data that users can organize, search, and modify. In general, an OCR engine analyzes the pixel data of scanned images and searches for patterns resembling letters, numbers, and other symbols to create a digitized record of characters. While the exact mechanics of this process can be complicated, OCR engines ultimately enable users to easily and effectively perform a wide array of functions such as information entry, processing, categorization, retrieval, and analysis.
Applications of OCR
Optical Character Recognition employs robust technology to digitally convert, recognize, and manage scanned paper and machine-readable documents promptly and accurately. Such reliable OCR capabilities power vital systems, facilitate essential services, improve routine operations, and promote overall efficiency. Two significant methods of such Optical Character Recognition are:
Full Page OCR – Converts the entire page into one of the following formats:
- Plain Text – Basic text information on the page is retained in a consecutive order.
- Formatted Text – Text information is retained in consecutive paragraphs while saving font size and style. This can also preserve tables in a tabular format, such as spreadsheets.
- Exact Copy – All information on the page is retained, including graphics, and placed on the page in the manner that most closely recreates the original document.
- Searchable File – Text information is retained on a hidden layer behind the scanned image, allowing the file’s contents to be searched while retaining the appearance of the original.
Zone OCR – Recognizes document structure and identifies fields of text located on defined fields of the page. This zonal method is often applied for the purpose of indexing and document management. Detailed information can be distinguished and utilized to perform numerous functions, such as saving specific metadata to particular locations, archiving strings of text into organized formats like databases, automating the population of information and processes, and more.
Some documents are difficult or impossible to automate with OCR. For example, documents with non-standard layouts, unconstrained handwriting or very poor scan quality. In applications like invoice processing, fully automating the data entry can require expensive software and weeks of consulting. Even after all that expense, many users miss the interface and data validations that their accounting software entry screens provide.
In cases like this, SimpleIndex can help improve data entry efficiency while archiving your scanned originals at the same time. Here’s how it works:
- Scan a batch of documents for data entry
- Place the SimpleIndex window side-by-side with your data entry window
- Enter the data normally, reading from the scanned image in SimpleIndex
- Press the hotkey combo to transfer the data to SimpleIndex
- Save the image and repeat with the next one
In this configuration, SimpleIndex captures an image of the data entry window, then uses OCR to read the data and index the image. Since the data entry screen has a consistent layout and clear, readable fonts, it can be reliably recognized with OCR.
There are several advantages to this approach:
- Configuration and training takes hours not weeks
- Scanned images are indexed with no extra work
- All the advantages of digital docs–security, searching, sharing, etc.
- Use all the data validation features of your software
- No flipping through paper documents
- Operator keeps eyes on the screen and hands on the keyboard
- Data entry can be done remotely
- Data entry performance improves and files are archived at the same time
Many document scanning solutions use Zone OCR to obtain index data from the page.
SimpleIndex improves upon this time-tested but ultimately limited model with its Dynamic OCR feature.
Let’s look at the difference between the two methods:
Zone OCR is used to read document indexes or tags from text on the page. It is a great way to automate the data entry associated with scanning documents.
However, there are several limitations to zone OCR that must be overcome:
- Index information must be in the exact same place on every page
- Documents shift and skew during scanning, causing the zones to not line up
- If surrounding lines or text on the document are too close, they can encroach on the zone
SimpleIndex overcomes these limitations by using Dynamic OCR technology to locate the desired text even when it moves around on the page. Our simplified version of Dynamic OCR works great for many types of documents at a fraction of the cost of other solutions.
- Index information can appear anywhere on any page
- Unwanted characters are automatically ignored
- Find unique patterns of letters and numbers using Template Matching
(Social Security #, Date, etc.)
- Use Dictionary Matching to find a value from a list of possible values
(Vendor Name, Document Type, etc.)
Dynamic OCR Examples
In the video we see how SimpleIndex approaches a typical Zone OCR example. With SimpleIndex you can use large zones that give a wide margin for error. Template and Dictionary matching are then used to extract the 7-digit Account Number, 6-digit Order Number and Company Name. SimpleIndex discards the surrounding text and keeps the correct value.
Another common example is finding a unique identifier, for example a social security number, that could appear anywhere on the page. Simply enter the template ###-##-#### and SimpleIndex will search the full OCR text until it finds a match. Since only one social security number is likely to appear on the page, a match on this pattern is almost certainly the required value.
With dictionary matching, you can give SimpleIndex a list of possible values and it will automatically search the zone or page for each possible value until it finds a match.
Many dynamic forms processing applications can be implemented using these simple algorithms. This makes SimpleIndex far more versatile than other zone OCR solutions that require the index value to be in the exact same location on every page. Yet SimpleIndex costs only a fraction of the price!
SimpleIndex‘s dynamic forms processing can greatly speed up data entry by eliminating a good percentage of indexing work. For many this can put the labor cost of scanning within their reach.
Dynamic OCR can also be applied to MS Office and PDF files, creating a fully automated process for intelligently indexing and reorganizing electronic documents.
Amazon AWS Textract Cloud OCR
With Textract you can capture data from almost any type of form, including handwritten ones! Textract identifies labeled text anywhere on the document and returns the label text along with the corresponding value. Map the labels to index fields in SimpleIndex and you are ready to capture that data no matter where it appears on the page.
Textract uses machine learning with a huge model based on the billions of pages processed using Textract to provide the most accurate OCR and form field extraction solution available.
By default, Textract is only available as an API and requires custom coding to integrate it into your document workflows. SimpleIndex turns it into a fully-featured document batch document and data processing app that is ready to use out-of-the-box.
Since there are no templates to configure or train, setup can be done in hours instead of days or weeks months required by other enterprise data capture solutions.
Pay-as-you-go pricing makes SimpleIndex with Textract the most affordable way to batch process forms for projects with less than 50,000 pages per year to process, especially if you need to read handwriting or have forms with many layout variations.
Support for Regular Expressions
SimpleIndex OCR has a simple built-in template format, as well as support for Regular Expressions. Regular Expressions (RegEx for short) let you define complex search patterns to extract matching values from the text. This greatly enhances the functionality of the dynamic OCR in SimpleIndex, making it capable of finding variable-length fields with no distinct pattern.
Regular Expressions are a commonly used in text parsing applications. The Perl programming language makes extensive use of RegEx, as do UNIX utilities like “grep”. Many programmers and IT personnel are already familiar with RegEx and can create complex expressions without specific training.
New OCR Features in Latest Version
- OCR language pack now includes all available Tesseract languages including Hindi, Tamil, Arabic, Chinese, Thai, Vietnamese, Japanese, Korean, Indonesian, Hebrew and many more.
- FineReader engine upgraded from version 9 to version 11, providing improved accuracy, MRC compression and multi-threaded processing for large documents.
- Amazon AWS Textract Cloud OCR option gives you advanced forms extraction, accounts payable invoice and receipt extraction, handprint recognition, and the most accurate OCR available.
How to Configure SimpleIndex OCR
Our Wiki help has extensive information on how to configure OCR for various document and data capture scenarios.
- Zone OCR read data in a specific location
- Template matching to match unique patterns
- Dictionary matching to match a list of possible values
- OCR Options OCR job settings that apply to all fields
- File Formats that can be output by OCR
- Languages supported by OCR
- FineReader versus Tesseract OCR engines
- Searchable PDF with MRC compression
- OCR to Field for point and click OCR during verification
- Cloud OCR using Textract
Watch this Simple Software University training video to see how to configure and run an OCR job with SimpleIndex.
KB Articles for Optical Character Recognition (OCR)
- Language Pack for Standard/Tesseract OCR
- Languages Supported in SimpleSoftware OCR Engines
- What is Document Imaging?
- Change the Dictionary Separator Value
- Change the OCR Font or Type
- Regular Expression (RegEx) - Syntax or Type
- Autonumber Increment Value
- I'm using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- If I have a form which is filled manually by hand, can SimpleIndex read the data from it?