Zone OCR is used to read document indexes or tags from text on the page. It is a great way to automate the data entry associated with scanning documents.
However, there are several limitations to zone OCR that must be overcome:
- Index information must be in the exact same place on every page
- Documents shift and skew during scanning, causing the zones to not line up
- If surrounding lines or text on the document are too close, they can encroach on the zone
Language Pack for Standard/Tesseract OCR
All versions of the SimpleIndex software include OCR with the Standard/Tesseract OCR engine. The SimpleIndex download only includes a limited set of languages with the installation. If the language you would like to OCR with SimpleIndex isn’t one of the languages included then you can download your required language(s). Once you do this you will be able to pick the language that you want to read with the Standard/Tesseract OCR engine.
- Go to the Tesseract Language Download Site
- Select the language you want and download or download all the language
- Copy the language files (unzip if downloading more than one language) to this folder: C:\Program Files (x86)\SimpleIndex\Tesseract\v3.04\tessdata
- Close and Reopen SimpleIndex and the downloaded languages will now be selectable
Languages Supported in SimpleSoftware OCR Engines
SimpleSoftware OCR engines are using two different systems for language support. In the end languages supported by your OCR is based on your version of SimpleIndex installed, any addons (SimpleIndex Server, SimpleCoversheet, and so on) do not add any additional language support.
All SimpleSoftware products have Tesseract 3.02 OCR languages support. You can learn more about it and download additional language libraries HERE. And you can check and add more OCR languages libraries supported with Tesseract on your station here:
C:\Program Files (x86)\SimpleIndex\Tesseract\v3.02\tessdata
SimpleIndex Pro and SimpleIndex OCR are using FineReader engine. It has one of the largest libraries of supported OCR languages. You can check OCR languages supported with FineReader on your station here:
C:\Program Files (x86)\SimpleIndex\OCRLanguages.txt
Abkhaz
Adyghe
Afrikaans
Agul
Albanian
Altaic
Armenian Eastern
Armenian Grabar
Armenian Western
Awar
Aymara
Azeri Cyrillic
Azeri Latin
Bashkir
Basque
Belarusian
Bemba
Blackfoot
Breton
Bugotu
Bulgarian
Buryat
Catalan
Chamorro
Chechen
Chukcha
Chuvash
Corsican
Crimean Tatar
Croatian
Crow
Czech
Danish
Dargwa
Dungan
Dutch Belgian
Dutch Standard
English
English Australian
English Belize
English Canadian
English Caribbean
English Ireland
English Jamaica
English Law
English Medical
English New Zealand
English Philippines
English South Africa
English Trinidad
English United Kingdom
English United States
English Zimbabwe
Eskimo Cyrillic
Eskimo Latin
Esperanto
Estonian
Even
Evenki
Faeroese
Fijian
Finnish
French
French Belgian
French Canadian
French Luxembourg
French Monaco
French Standard
French Swiss
Frisian
Friulian
Gaelic Scottish
Gagauz
Galician
Ganda
German
German Austrian
German Law
German Liechtenstein
German Luxembourg
German Medical
German New Spelling
German New Spelling Law
German New Spelling Medical
German Standard
German Swiss
Greek
Guarani
Hani
Hausa
Hawaiian
Hungarian
Icelandic
Ido
Indonesian
Ingush
Interlingua
Irish
Italian
Italian Standard
Italian Swiss
Kabardian
Kalmyk
Karachay Balkar
Karakalpak
Kasub
Kawa
Kazakh
Khakas
Khanty
Kikuyu
Kirgiz
Kongo
Koryak
Kpelle
Kumyk
Kurdish
Lak
Lappish
Latin
Latvian
Latvian Gothic
Lezgin
Lithuanian
Lithuanian Classic
Luba
Macedonian
Malagasy
Malay Brunei Darussalam
Malay Malaysian
Malinke
Maltese
Mansi
Maori
Mari
Maya
Miao
Minankabaw
Mohawk
Mongol
Mordvin
Nahuatl
Nenets
Nivkh
Nogay
Norwegian Bokmal
Norwegian Nynorsk
Null
Nyanja
Occidental
Ojibway
Old English
Old French
Old German
Old Italian
Old Spanish
Ossetic
Papiamento
Pidgin English
Polish
Portuguese Brazilian
Portuguese Standard
Provencal
Quechua
Rhaeto Romanic
Romanian
Romanian Moldavia
Romany
Ruanda
Rundi
Russian
Russian Moldavia
Russian Old Spelling
Samoan
Selkup
Serbian Cyrillic
Serbian Latin
Shona
Sioux
Slovak
Slovenian
Somali
Sorbian
Sotho
Spanish
Spanish Argentina
Spanish Bolivia
Spanish Chile
Spanish Colombia
Spanish Costa Rica
Spanish Dominican Republic
Spanish Ecuador
Spanish El Salvador
Spanish Guatemala
Spanish Honduras
Spanish Mexican
Spanish Modern Sort
Spanish Nicaragua
Spanish Panama
Spanish Paraguay
Spanish Peru
Spanish Puerto Rico
Spanish Traditional Sort
Spanish Uruguay
Spanish Venezuela
Sunda
Swahili
Swazi
Swedish
Swedish Finland
Tabassaran
Tagalog
Tahitian
Tajik
Tatar
Tinpo
Tongan
Tswana
Tun
Turkish
Turkmen
Tuvin
Udmurt
Uighur Cyrillic
Uighur Latin
Ukrainian
Uzbek Cyrillic
Uzbek Latin
Visayan
Welsh
Wolof
Xhosa
Yakut
Yiddish
Zapotec
Zulu
Change the OCR Font or Type
This is used to changed the default OCR recognition font or type from the default, which is “To Be Detected”. This can be used to look for a specific type of OCR font and is especially useful for recognizing things like Dotmatrix, OCR A and OCR B.
Instructions for setting OCR Font:
1. Right click on the .sic file and select Open With a text editor (Notepad, Wordpad, etc.)
2. Find <OCR_TEXT_TYPE>. If you can’t find <OCR_TEXT_TYPE> then add the following as the last row in the text file:
<OCR_TEXT_TYPE>#</OCR_TEXT_TYPE>
3. Change the number in between: <OCR_TEXT_TYPE>#</OCR_TEXT_TYPE>

4. Number of desired font:
- 0 Normal
- 1 Typewriter
- 2 Dotmatrix
- 3 Index
- 5 OCR A
- 6 OCR B
- 7 MICR E13B
- 8 MICR CMC7
- 9 Gothic
- 10 To Be Detected
5. Close and save file
I’m using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
SimpleIndex version 7 solves this problem with the incorporation of the FineReader OCR engine. Full text in PDFs will now flow with the formatting of the PDF.
Legacy Versions: SimpleIndex can also be used with other OCR applications and servers to improve accuracy, formatting and performance. Use the OCR applications to convert the scanned images to text or searchable PDF, and SimpleIndex can extract index values from the text and automatically sort and organize the files.
- Published in OCR
How do you train the OCR engine for better accuracy?
Training has been removed with version 7 due to the addition of the ABBYY FineReader OCR engine.
- Published in OCR
How do you configure full text searching in Retrieval mode?
On the Database tab there dropdown in the lower portion of the panel for Full Text OCR Field. Put the name of the field that will store the full-text data there. This must be configured both for Insert and Retrieval mode configurations. The database field needs to be sufficient length to store the entire text of your document.
Of course, the Insert Mode configuration must have “Enable Full Page OCR” checked to generate full text data from images. Text from MS Office documents, PDF files and existing OCR text files can be used without setting this option.
When designing your Retrieval Mode configuration, create a Text field to use for full text search queries. On the Database tab, set the corresponding “Database Field Name” to the full text database field.
When searching on your full text field, SimpleIndex finds the text you enter no matter where it appears in the document. It is able to match partial words. It does not perform boolean or natural language searches. The text entered must match the document text exactly.
- Published in Database & Retrieval, OCR
How can I improve recognition rates for my OCR fields?
There are several things you can do to improve accuracy for OCR.
- Scan at 300dpi, black & white for best results.
- Adjust the scan settings to remove background noise and improve the definition of characters.
- For Zone OCR, field recognition can often vary based on the surrounding white space and text in the zone. Try varying the size of the zone to achieve optimal results.
- For template matching, make sure all variations of the field format are included in the template list.
- For dictionary matching, add common variations and OCR mistakes to the “thesaurus” list.
- On the Zones & OCR tab (accessed from the Job Options) you can adjust the Max Errors setting to allow for more mistakes in the dictionary matching process.
- Use the Strip Spaces, Strip Characters, Replace Characters and Case Fixing options to standardize the field format prior to matching.
Please refer to the SimpleIndex Wiki for details on how to configure these options.
Related Links
- SimpleIndex.com – Zone OCR
- SimpleIndex.com – Dynamic OCR
- SimpleOCR.com – OCR Guide
- SimpleIndex Wiki – OCR
- SimpleIndex Wiki – OCR Options
- SimpleIndex Wiki – Zone OCR
- SimpleIndex Wiki – Full Page OCR
- SimpleIndex Wiki – Zones & OCR Settings
- SimpleIndex Wiki – OCR to Field
- SimpleIndex Wiki – OCR Text View
- SimpleIndex Wiki – Template & Dictionary Matching OCR
- SimpleIndex Wiki – OMR and OCR Document Separation
- Published in OCR
Can OCR text be saved to Office, Text, HTML or other formats?
Yes. On the OCR step of the Job Settings Wizard you can select the text output format need in the “Full-page OCR file type” drop down. By default it is set to PDF, but can be changed to Text (txt), Word (docx), Rich Text (rtf), Open Office (odt), Excel (xlsx), PowerPoint (pptx), ePub Zip (epub), FictionBook (fb2), HTML (htm), XML (xml) or Alto XML (alto.xml).
If the output file type is set to PDF, OCR text will be embedded as hidden text in the PDF file.
Related Links
- Published in Licensing & Installation, OCR
Can SimpleIndex create searchable PDF Image+Text files with hidden text?
Yes, it can. You can configure this setting in the Job Settings Wizard by going to the OCR step and checking “Enable full-page OCR”. There are many settings in the OCR step that you can used to customize the output and recognition of images.
SimpleIndex has two different OCR engines (Standard and Professional) that can be used to produced PDF Image + Text files or Searchable PDFs.
Related Links
- Published in Export, OCR, Office PDF Text Processing
Indexing from Applications with Screen OCR
Some documents are difficult or impossible to automate with OCR. For example, documents with non-standard layouts, unconstrained handwriting or very poor scan quality. In applications like invoice processing, fully automating the data entry can require expensive software and weeks of consulting. Even after all that expense, many users miss the interface and data validations that their accounting software entry screens provide.
In cases like this, SimpleIndex can help improve data entry efficiency while archiving your scanned originals at the same time. Here’s how it works:
- Scan a batch of documents for data entry
- Place the SimpleIndex window side-by-side with your data entry window
- Enter the data normally, reading from the scanned image in SimpleIndex
- Press the hotkey combo to transfer the data to SimpleIndex
- Save the image and repeat with the next one
In this configuration, SimpleIndex captures an image of the data entry window, then uses OCR to read the data and index the image. Since the data entry screen has a consistent layout and clear, readable fonts, it can be reliably recognized with OCR.
There are several advantages to this approach:
- Configuration and training takes hours not weeks
- Scanned images are indexed with no extra work
- All the advantages of digital docs–security, searching, sharing, etc.
- Use all the data validation features of your software
- No flipping through paper documents
- Operator keeps eyes on the screen and hands on the keyboard
- Data entry can be done remotely
- Data entry performance improves and files are archived at the same time
Full-Page OCR Indexing Demo
This sample job demonstrates the ability for SimpleIndex to convert scanned documents to searchable PDF files and extract index data from the OCR text. It also demonstrates the multi-user workflow capabilities.
Step 1 uses a full-page OCR process on each image.
Field data is extracted from the full-page text using template and dictionary matching algorithms.
This is done in Pre-Index mode to allow unattended processing.
Data is saved to a database so it can be reviewed and corrected in Step 2.
Step 2 uses Database Update mode to find images with missing index values and allow the user to manually enter the correct data.
Step 3 uses a SimpleSearch configuration to search and view the indexed images, including full text searches.
Find Out More
- Download or get an Online Demo
- Dynamic OCR Features in SimpleIndex
- Full-Page OCR Wiki Pages
- OCR Features and Settings Wiki Pages
- OCR Software Guide on SimpleOCR
FAQ Related to Full-Page OCR
- SimpleIndex 10.1 with Textract!
- Accounts Payable Automation with RPA
- Language Pack for Standard/Tesseract OCR
- Languages Supported in SimpleSoftware OCR Engines
- How to activate SimpleExport?
- Regular Expression (RegEx) - Syntax or Type
- SimpleQB - QuickBooks Company File Warning
- Will your SimpleQB allow me to scan in old invoices or bank statements directly into QuickBooks?
Zone OCR with Template Matching
This video shows the Zone OCR Invoice Processing sample job. Zone OCR is the traditional method for extracting index data from printed text appearing in fixed locations on every page.
The video also shows how Zone OCR is enhanced with SimpleIndex‘s Template Matching and Dictionary Matching features, giving you much more margin for error than other solutions.
Find Out More
- Download or get an Online Demo
- Dynamic OCR Features in SimpleIndex
- OCR Features and Settings Wiki Pages
- OCR Software Guide on SimpleOCR
FAQ Related to Zone OCR
- Language Pack for Standard/Tesseract OCR
- Languages Supported in SimpleSoftware OCR Engines
- Change the Dictionary Separator Value
- Change the OCR Font or Type
- Regular Expression (RegEx) - Syntax or Type
- I'm using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- How do you train the OCR engine for better accuracy?
Zone OCR and Dynamic OCR

Many document scanning solutions use Zone OCR to obtain index data from the page.
SimpleIndex improves upon this time-tested but ultimately limited model with its Dynamic OCR feature.
Let’s look at the difference between the two methods:
Zone OCR
Zone OCR is used to read document indexes or tags from text on the page. It is a great way to automate the data entry associated with scanning documents.
However, there are several limitations to zone OCR that must be overcome:
- Index information must be in the exact same place on every page
- Documents shift and skew during scanning, causing the zones to not line up
- If surrounding lines or text on the document are too close, they can encroach on the zone
Dynamic OCR
SimpleIndex overcomes these limitations by using Dynamic OCR technology to locate the desired text even when it moves around on the page. Our simplified version of Dynamic OCR works great for many types of documents at a fraction of the cost of other solutions.
- Index information can appear anywhere on any page
- Unwanted characters are automatically ignored
- Find unique patterns of letters and numbers using Template Matching
(Social Security #, Date, etc.) - Use Dictionary Matching to find a value from a list of possible values
(Vendor Name, Document Type, etc.)
Dynamic OCR Examples
In the video we see how SimpleIndex approaches a typical Zone OCR example. With SimpleIndex you can use large zones that give a wide margin for error. Template and Dictionary matching are then used to extract the 7-digit Account Number, 6-digit Order Number and Company Name. SimpleIndex discards the surrounding text and keeps the correct value.
Another common example is finding a unique identifier, for example a social security number, that could appear anywhere on the page. Simply enter the template ###-##-#### and SimpleIndex will search the full OCR text until it finds a match. Since only one social security number is likely to appear on the page, a match on this pattern is almost certainly the required value.
With dictionary matching, you can give SimpleIndex a list of possible values and it will automatically search the zone or page for each possible value until it finds a match.
Many dynamic forms processing applications can be implemented using these simple algorithms. This makes SimpleIndex far more versatile than other zone OCR solutions that require the index value to be in the exact same location on every page. Yet SimpleIndex costs only a fraction of the price!
SimpleIndex‘s dynamic forms processing can greatly speed up data entry by eliminating a good percentage of indexing work. For many this can put the labor cost of scanning within their reach.

Dynamic OCR can also be applied to MS Office and PDF files, creating a fully automated process for intelligently indexing and reorganizing electronic documents.

Amazon AWS Textract Cloud OCR
With Textract you can capture data from almost any type of form, including handwritten ones! Textract identifies labeled text anywhere on the document and returns the label text along with the corresponding value. Map the labels to index fields in SimpleIndex and you are ready to capture that data no matter where it appears on the page.
Textract uses machine learning with a huge model based on the billions of pages processed using Textract to provide the most accurate OCR and form field extraction solution available.
By default, Textract is only available as an API and requires custom coding to integrate it into your document workflows. SimpleIndex turns it into a fully-featured document batch document and data processing app that is ready to use out-of-the-box.
Since there are no templates to configure or train, setup can be done in hours instead of days or weeks months required by other enterprise data capture solutions.
Pay-as-you-go pricing makes SimpleIndex with Textract the most affordable way to batch process forms for projects with less than 50,000 pages per year to process, especially if you need to read handwriting or have forms with many layout variations.
Wiki: How to configure AWS Textract OCR in SimpleIndex
Support for Regular Expressions

SimpleIndex OCR has a simple built-in template format, as well as support for Regular Expressions. Regular Expressions (RegEx for short) let you define complex search patterns to extract matching values from the text. This greatly enhances the functionality of the dynamic OCR in SimpleIndex, making it capable of finding variable-length fields with no distinct pattern.
Regular Expressions are a commonly used in text parsing applications. The Perl programming language makes extensive use of RegEx, as do UNIX utilities like “grep”. Many programmers and IT personnel are already familiar with RegEx and can create complex expressions without specific training.
Click here for a reference guide to Regular Expressions
New OCR Features in Latest Version
- OCR language pack now includes all available Tesseract languages including Hindi, Tamil, Arabic, Chinese, Thai, Vietnamese, Japanese, Korean, Indonesian, Hebrew and many more.
- FineReader engine upgraded from version 9 to version 11, providing improved accuracy, MRC compression and multi-threaded processing for large documents.
- Amazon AWS Textract Cloud OCR option gives you advanced forms extraction, accounts payable invoice and receipt extraction, handprint recognition, and the most accurate OCR available.
How to Configure SimpleIndex OCR
Our Wiki help has extensive information on how to configure OCR for various document and data capture scenarios.
- Zone OCR read data in a specific location
- Template matching to match unique patterns
- Dictionary matching to match a list of possible values
- OCR Options OCR job settings that apply to all fields
- File Formats that can be output by OCR
- Languages supported by OCR
- FineReader versus Tesseract OCR engines
- Searchable PDF with MRC compression
- OCR to Field for point and click OCR during verification
- Cloud OCR using Textract
Watch this Simple Software University training video to see how to configure and run an OCR job with SimpleIndex.
KB Articles for Optical Character Recognition (OCR)
- Language Pack for Standard/Tesseract OCR
- Languages Supported in SimpleSoftware OCR Engines
- What is Document Imaging?
- Change the Dictionary Separator Value
- Change the OCR Font or Type
- Regular Expression (RegEx) - Syntax or Type
- Autonumber Increment Value
- I'm using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- If I have a form which is filled manually by hand, can SimpleIndex read the data from it?
Compare Leading Solutions
SimpleIndex™
Kodak Capture Pro™
Kofax Express™
PaperVision™ Capture Desktop
Note: This video depicts PaperVision Capture Desktop, a now discontinued software that has since been replaced by the similarly functioning updated version of PaperFlow.
Office Gemini DiamondVision™
Testing Methods
The benchmark times were recorded using all available software shortcuts, and by performing data entry and user interactions as fast as possible. The same scanner and computer hardware was used for each test. Much care was taken to ensure that each application yielded the most accurate OCR results possible given the sample documents.
Unfortunately none our competitors could accurately capture the account number on all 10 pages. The extra time to correct these errors accounts for 15-30% of the difference in processing times. The difference in accuracy is due in large part to SimpleIndex‘s pattern matching OCR feature, which the other programs lack.
Keep in mind these videos were recording using the latest version available at the time this test was taken. Results may vary with with later versions.