Optical Character Recognition to read index data from scanned images, convert documents to searchable PDF or text files, dynamic field extraction and other OCR features.
Please refer to the Wiki Documentation for the complete Global Settings Wizard reference.
All versions of the SimpleIndex software include OCR with the Standard/Tesseract OCR engine. The SimpleIndex download only includes a limited set of languages with the installation. If the language you would like to OCR with SimpleIndex isn’t one of the languages included then you can download your required language(s). Once you do this you will be able to pick the language that you want to read with the Standard/Tesseract OCR engine.
- Go to the Tesseract Language Download Site
- Select the language you want and download or download all the language
- Copy the language files (unzip if downloading more than one language) to this folder: C:\Program Files (x86)\SimpleIndex\Tesseract\v3.04\tessdata
- Close and Reopen SimpleIndex and the downloaded languages will now be selectable
Please refer to the Wiki Documentation for the complete Languages reference.
SimpleSoftware OCR engines are using two different systems for language support. In the end languages supported by your OCR is based on your version of SimpleIndex installed, any addons (SimpleIndex Server, SimpleCoversheet, and so on) do not add any additional language support.
All SimpleSoftware products have Tesseract 3.02 OCR languages support. You can learn more about it and download additional language libraries HERE. And you can check and add more OCR languages libraries supported with Tesseract on your station here:
C:\Program Files (x86)\SimpleIndex\Tesseract\v3.02\tessdata
SimpleIndex Pro and SimpleIndex OCR are using FineReader engine. It has one of the largest libraries of supported OCR languages. You can check OCR languages supported with FineReader on your station here:
C:\Program Files (x86)\SimpleIndex\OCRLanguages.txt
English New Zealand
English South Africa
English United Kingdom
English United States
German New Spelling
German New Spelling Law
German New Spelling Medical
Malay Brunei Darussalam
Russian Old Spelling
Spanish Costa Rica
Spanish Dominican Republic
Spanish El Salvador
Spanish Modern Sort
Spanish Puerto Rico
Spanish Traditional Sort
Document Imaging was the more commonly used term in the early days of document scanning and OCR and refers to any system used to replicate documents used in business. It evolved from the microfilm days where it was referred to as Document Image Management.
Document Imaging allows for the scanning of paper documents, as well as the processing of files saved electronically. These files are then named and saved for later searching.
Other document imaging terms include automatic imaging software, best digital imaging software, best imaging software, desktop imaging software, digital document imaging, digital imaging software, document imaging download, document imaging PDF, document imaging processing, document imaging products, document imaging software, document imaging solution, document imaging solutions, document imaging systems, document imaging technologies, document imaging technology, document imaging tools, image to database, imaging resource, imaging scanning software, imaging software companies, imaging software download, imaging software for windows, imaging solution, scanner imaging software, scanning and imaging, scanning imaging, and software for imaging.
This is used to change the dictionary separator value when doing thesaurus matching from the default character of | to any character(s) that you want. This can be useful in cases where the values you would like in your list or dictionary might include the pipe character or “|” or “Shift Backslash”
This setting is also used as the delimiter when parsing multiple index field values from bar codes (e.g. field1|field2|field3).
Instructions for changing the dictionary separator value:
- Right click on the Job Configuration file that you would like to suppress the prompt on and select Open With>Notepad
- Search the XML settings text open in Notepad for this term:
- Change the value in-between from “|” to any other single character that you want.
- For TAB separation use %TAB%
Please refer to the Wiki Documentation for the complete OCR Options reference.
This is used to changed the default OCR recognition font or type from the default, which is “To Be Detected”. This can be used to look for a specific type of OCR font and is especially useful for recognizing things like Dotmatrix, OCR A and OCR B.
Instructions for setting OCR Font:
1. Right click on the .sic file and select Open With a text editor (Notepad, Wordpad, etc.)
2. Find <OCR_TEXT_TYPE>. If you can’t find <OCR_TEXT_TYPE> then add the following as the last row in the text file:
3. Change the number in between: <OCR_TEXT_TYPE>#</OCR_TEXT_TYPE>
4. Number of desired font:
- 0 Normal
- 1 Typewriter
- 2 Dotmatrix
- 3 Index
- 5 OCR A
- 6 OCR B
- 7 MICR E13B
- 8 MICR CMC7
- 9 Gothic
- 10 To Be Detected
5. Close and save file
Please refer to the Wiki Documentation for the complete Autonumber reference.
If you want to change the value of how much the Autonumber Increments each time from 1 to any number that you want then do the following:
1. Right click on the configuration file and “Open With” any text editor, such as Notepad.
2. Search for the following:
3. Change the number in this entry to the amount that you want the Autonumber to Increment:
4. Save the configuration file.
I’m using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
Please refer to the Wiki Documentation for the complete Full-Page OCR reference.
SimpleIndex version 7 solves this problem with the incorporation of the FineReader OCR engine. Full text in PDFs will now flow with the formatting of the PDF.
Legacy Versions: SimpleIndex can also be used with other OCR applications and servers to improve accuracy, formatting and performance. Use the OCR applications to convert the scanned images to text or searchable PDF, and SimpleIndex can extract index values from the text and automatically sort and organize the files.
Is there a way to just use part of a bar code or OCR value? For example, extract “50” from the value “124450”
Please refer to the Wiki Documentation for the complete Bar Code Recognition reference.
To do this example, create a barcode field (Field 1 for example) and a 2nd field with type “Fixed”. In the template for the 2nd field, enter %FIELD1[5,2]% to get “50” from “124450”.
%FIELD1% would get the entire value for Field #1, the barcode field. By adding the [5,2] you tell SimpleIndex to start at the 5th character (5) and take 2 characters from the value (50).
Please refer to the Wiki Documentation for the complete Handprint Recognition reference.
SimpleIndex offers two kinds of ICR (Intelligent Character Recognition) for converting printed and script handwriting to text.
The FineReader OCR engine offers handprint recognition designed for forms processing. it is optimized for hand-filled forms that use letter boxes or combs to ensure each letter is separated. FineReader will also work with underlined text as long as it is printed. For cursive scripts the Cloud OCR option is recommended.
Training has been removed with version 7 due to the addition of the ABBYY FineReader OCR engine.
Please refer to the Wiki Documentation for the complete Database Settings reference.
On the Database tab there dropdown in the lower portion of the panel for Full Text OCR Field. Put the name of the field that will store the full-text data there. This must be configured both for Insert and Retrieval mode configurations. The database field needs to be sufficient length to store the entire text of your document.
Of course, the Insert Mode configuration must have “Enable Full Page OCR” checked to generate full text data from images. Text from MS Office documents, PDF files and existing OCR text files can be used without setting this option.
When designing your Retrieval Mode configuration, create a Text field to use for full text search queries. On the Database tab, set the corresponding “Database Field Name” to the full text database field.
When searching on your full text field, SimpleIndex finds the text you enter no matter where it appears in the document. It is able to match partial words. It does not perform boolean or natural language searches. The text entered must match the document text exactly.
Please refer to the Wiki Documentation for the complete Zones & OCR Settings reference.
MS Office and PDF files generated by software or PDF printer drivers already have the text you need to recognize in the file. Scanned documents need to use OCR to read text from an image of the page. With Office and PDF files, SimpleIndex can just read the text, which is much faster and accurate than image OCR.
To recognize index fields from the document text, first create OCR fields on the Index tab as you would normally. Next, on the Zones & OCR options tab, check the “Use Full Page OCR for this Field” option for each OCR field. This tells SimpleIndex to process the existing file text.
If the index value is a unique pattern of digits or list of possible values, use Template or Dictionary matching to locate the value within the text. Please see the manual for details on Template and Dictionary matching.
If the value appears in a specific location in each file, coordinates can be used to locate it. When processing text, the X, Y, Width and Height settings correspond to line and column numbers within the file text. This is explained in greater depth in the manual.
SimpleIndex will assume that any TXT file with the same name as a file being processed is the OCR text for that file, so this method can work with any type of file.
Find out more about Optical Character Recognition on the SimpleOCR Guide.
There are several things you can do to improve accuracy for OCR.
- Scan at 300dpi, black & white for best results.
- Adjust the scan settings to remove background noise and improve the definition of characters.
- For Zone OCR, field recognition can often vary based on the surrounding white space and text in the zone. Try varying the size of the zone to achieve optimal results.
- For template matching, make sure all variations of the field format are included in the template list.
- For dictionary matching, add common variations and OCR mistakes to the “thesaurus” list.
- On the Zones & OCR tab (accessed from the Job Options) you can adjust the Max Errors setting to allow for more mistakes in the dictionary matching process.
- Use the Strip Spaces, Strip Characters, Replace Characters and Case Fixing options to standardize the field format prior to matching.
Please refer to the SimpleIndex Wiki for details on how to configure these options.
- SimpleIndex.com – Zone OCR
- SimpleIndex.com – Dynamic OCR
- SimpleOCR.com – OCR Guide
- SimpleIndex Wiki – OCR
- SimpleIndex Wiki – OCR Options
- SimpleIndex Wiki – Zone OCR
- SimpleIndex Wiki – Full Page OCR
- SimpleIndex Wiki – Zones & OCR Settings
- SimpleIndex Wiki – OCR to Field
- SimpleIndex Wiki – OCR Text View
- SimpleIndex Wiki – Template & Dictionary Matching OCR
- SimpleIndex Wiki – OMR and OCR Document Separation
Yes. On the OCR step of the Job Settings Wizard you can select the text output format need in the “Full-page OCR file type” drop down. By default it is set to PDF, but can be changed to Text (txt), Word (docx), Rich Text (rtf), Open Office (odt), Excel (xlsx), PowerPoint (pptx), ePub Zip (epub), FictionBook (fb2), HTML (htm), XML (xml) or Alto XML (alto.xml).
If the output file type is set to PDF, OCR text will be embedded as hidden text in the PDF file.
Yes, it can. You can configure this setting in the Job Settings Wizard by going to the OCR step and checking “Enable full-page OCR”. There are many settings in the OCR step that you can used to customize the output and recognition of images.
SimpleIndex has two different OCR engines (Standard and Professional) that can be used to produced PDF Image + Text files or Searchable PDFs.