OCR Pages - SimpleIndex

Language Pack for Standard/Tesseract OCR

Monday, 01 November 2021 by Alex Stewart

Please refer to the Wiki Documentation for the complete Global Settings Wizard reference.

All versions of the SimpleIndex software include OCR with the Standard/Tesseract OCR engine. The SimpleIndex download only includes a limited set of languages with the installation. If the language you would like to OCR with SimpleIndex isn’t one of the languages included then you can download your required language(s). Once you do this you will be able to pick the language that you want to read with the Standard/Tesseract OCR engine.

Go to the Tesseract Language Download Site
Select the language you want and download or download all the language
Copy the language files (unzip if downloading more than one language) to this folder: C:\Program Files (x86)\SimpleIndex\Tesseract\v3.04\tessdata
Close and Reopen SimpleIndex and the downloaded languages will now be selectable

No Comments

Languages Supported in SimpleSoftware OCR Engines

Monday, 02 December 2019 by Simple Software

Please refer to the Wiki Documentation for the complete Languages reference.

SimpleSoftware OCR engines are using two different systems for language support. In the end languages supported by your OCR is based on your version of SimpleIndex installed, any addons (SimpleIndex Server, SimpleCoversheet, and so on) do not add any additional language support.

All SimpleSoftware products have Tesseract 3.02 OCR languages support. You can learn more about it and download additional language libraries HERE. And you can check and add more OCR languages libraries supported with Tesseract on your station here:

C:\Program Files (x86)\SimpleIndex\Tesseract\v3.02\tessdata

SimpleIndex Pro and SimpleIndex OCR are using FineReader engine. It has one of the largest libraries of supported OCR languages. You can check OCR languages supported with FineReader on your station here:

C:\Program Files (x86)\SimpleIndex\OCRLanguages.txt

Abkhaz
Adyghe
Afrikaans
Agul
Albanian
Altaic
Armenian Eastern
Armenian Grabar
Armenian Western
Awar
Aymara
Azeri Cyrillic
Azeri Latin
Bashkir
Basque
Belarusian
Bemba
Blackfoot
Breton
Bugotu
Bulgarian
Buryat
Catalan
Chamorro
Chechen
Chukcha
Chuvash
Corsican
Crimean Tatar
Croatian
Crow
Czech
Danish
Dargwa
Dungan
Dutch Belgian
Dutch Standard
English
English Australian
English Belize
English Canadian
English Caribbean
English Ireland
English Jamaica
English Law
English Medical
English New Zealand
English Philippines
English South Africa
English Trinidad
English United Kingdom
English United States
English Zimbabwe
Eskimo Cyrillic
Eskimo Latin
Esperanto
Estonian
Even
Evenki
Faeroese
Fijian
Finnish
French
French Belgian
French Canadian
French Luxembourg
French Monaco
French Standard
French Swiss
Frisian
Friulian
Gaelic Scottish
Gagauz
Galician
Ganda
German
German Austrian
German Law
German Liechtenstein
German Luxembourg
German Medical
German New Spelling
German New Spelling Law
German New Spelling Medical
German Standard
German Swiss
Greek
Guarani
Hani
Hausa
Hawaiian
Hungarian
Icelandic
Ido
Indonesian
Ingush
Interlingua
Irish
Italian
Italian Standard
Italian Swiss
Kabardian
Kalmyk
Karachay Balkar
Karakalpak
Kasub
Kawa
Kazakh
Khakas
Khanty
Kikuyu
Kirgiz
Kongo
Koryak
Kpelle
Kumyk
Kurdish

Lak
Lappish
Latin
Latvian
Latvian Gothic
Lezgin
Lithuanian
Lithuanian Classic
Luba
Macedonian
Malagasy
Malay Brunei Darussalam
Malay Malaysian
Malinke
Maltese
Mansi
Maori
Mari
Maya
Miao
Minankabaw
Mohawk
Mongol
Mordvin
Nahuatl
Nenets
Nivkh
Nogay
Norwegian Bokmal
Norwegian Nynorsk
Null
Nyanja
Occidental
Ojibway
Old English
Old French
Old German
Old Italian
Old Spanish
Ossetic
Papiamento
Pidgin English
Polish
Portuguese Brazilian
Portuguese Standard
Provencal
Quechua
Rhaeto Romanic
Romanian
Romanian Moldavia
Romany
Ruanda
Rundi
Russian
Russian Moldavia
Russian Old Spelling
Samoan
Selkup
Serbian Cyrillic
Serbian Latin
Shona
Sioux
Slovak
Slovenian
Somali
Sorbian
Sotho
Spanish
Spanish Argentina
Spanish Bolivia
Spanish Chile
Spanish Colombia
Spanish Costa Rica
Spanish Dominican Republic
Spanish Ecuador
Spanish El Salvador
Spanish Guatemala
Spanish Honduras
Spanish Mexican
Spanish Modern Sort
Spanish Nicaragua
Spanish Panama
Spanish Paraguay
Spanish Peru
Spanish Puerto Rico
Spanish Traditional Sort
Spanish Uruguay
Spanish Venezuela
Sunda
Swahili
Swazi
Swedish
Swedish Finland
Tabassaran
Tagalog
Tahitian
Tajik
Tatar
Tinpo
Tongan
Tswana
Tun
Turkish
Turkmen
Tuvin
Udmurt
Uighur Cyrillic
Uighur Latin
Ukrainian
Uzbek Cyrillic
Uzbek Latin
Visayan
Welsh
Wolof
Xhosa
Yakut
Yiddish
Zapotec
Zulu

Invoice OCR OCR OCR Form Processing OCR Scanning Server OCR Zone OCR

No Comments

What is Document Imaging?

Wednesday, 31 July 2019 by aaron

Document Imaging was the more commonly used term in the early days of document scanning and OCR and refers to any system used to replicate documents used in business. It evolved from the microfilm days where it was referred to as Document Image Management.

Document Imaging allows for the scanning of paper documents, as well as the processing of files saved electronically. These files are then named and saved for later searching.

Other document imaging terms include automatic imaging software, best digital imaging software, best imaging software, desktop imaging software, digital document imaging, digital imaging software, document imaging download, document imaging PDF, document imaging processing, document imaging products, document imaging software, document imaging solution, document imaging solutions, document imaging systems, document imaging technologies, document imaging technology, document imaging tools, image to database, imaging resource, imaging scanning software, imaging software companies, imaging software download, imaging software for windows, imaging solution, scanner imaging software, scanning and imaging, scanning imaging, and software for imaging.

Automatic Data Capture Automatic Indexing Software Document Automation Document Classification Document Imaging Document Management Software Document Scanning Image Scanning Keyword Indexing Office PDF Document Indexing Personal Document Management QuickBooks Document Management Required Documents Auditing Scanned Document Indexing Workflow

No Comments

Change the Dictionary Separator Value

Monday, 29 July 2019 by Simple Software

This is used to change the dictionary separator value when doing thesaurus matching from the default character of | to any character(s) that you want. This can be useful in cases where the values you would like in your list or dictionary might include the pipe character or “|” or “Shift Backslash”

This setting is also used as the delimiter when parsing multiple index field values from bar codes (e.g. field1|field2|field3).

Instructions for changing the dictionary separator value:

Right click on the Job Configuration file that you would like to suppress the prompt on and select Open With>Notepad
Search the XML settings text open in Notepad for this term:
<OCR_DICT_SEPARATOR>
Change the value in-between from “|” to any other single character that you want.
For TAB separation use %TAB%

This image has an empty alt attribute; its file name is Separator1.jpg

Bar Code Scanning Bar Codes Barcode OCR Barcode Reading Software Barcode Recognition Software OCR OCR Form Processing OCR Scanning PDF Barcode Recognition Zone OCR

No Comments

Change the OCR Font or Type

Monday, 29 July 2019 by Simple Software

Please refer to the Wiki Documentation for the complete OCR Options reference.

This is used to changed the default OCR recognition font or type from the default, which is “To Be Detected”. This can be used to look for a specific type of OCR font and is especially useful for recognizing things like Dotmatrix, OCR A and OCR B.

Instructions for setting OCR Font:

1. Right click on the .sic file and select Open With a text editor (Notepad, Wordpad, etc.)

2. Find <OCR_TEXT_TYPE>. If you can’t find <OCR_TEXT_TYPE> then add the following as the last row in the text file:

<OCR_TEXT_TYPE>#</OCR_TEXT_TYPE>

3. Change the number in between: <OCR_TEXT_TYPE>#</OCR_TEXT_TYPE>

4. Number of desired font:

0 Normal
1 Typewriter
2 Dotmatrix
3 Index
5 OCR A
6 OCR B
7 MICR E13B
8 MICR CMC7
9 Gothic
10 To Be Detected

5. Close and save file

Clipboard OCR OCR OCR Form Processing OCR Scanning Screen Scraping OCR Screenshot OCR TIFF PDF Annotations Zone OCR

No Comments

Regular Expression (RegEx) – Syntax or Type

Monday, 29 July 2019 by Simple Software

Please refer to the Wiki Documentation for the complete Regular Expressions reference.

SimpleIndex uses the .NET regular expressions library.

.NET uses the JavaScript/ECMAScript regular expression syntax format.

For more information see the Regular Expressions Wiki Page.

Barcode OCR Clipboard OCR Invoice OCR OCR OCR Form Processing OCR Scanning Screen Scraping OCR Screenshot OCR TWAIN Scanning Software Unattended Processing Zone OCR

No Comments

Autonumber Increment Value

Monday, 29 July 2019 by Simple Software

Please refer to the Wiki Documentation for the complete Autonumber reference.

If you want to change the value of how much the Autonumber Increments each time from 1 to any number that you want then do the following:

1. Right click on the configuration file and “Open With” any text editor, such as Notepad.
2. Search for the following:
AUTONUMBER_COUNT
3. Change the number in this entry to the amount that you want the Autonumber to Increment:
<AUTONUMBER_COUNT>1</AUTONUMBER_COUNT>
4. Save the configuration file.

Automatic Data Capture Document Automation

No Comments

I’m using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?

Wednesday, 28 February 2018 by dwilder

Please refer to the Wiki Documentation for the complete Full-Page OCR reference.

SimpleIndex version 7 solves this problem with the incorporation of the FineReader OCR engine. Full text in PDFs will now flow with the formatting of the PDF.

Legacy Versions: SimpleIndex can also be used with other OCR applications and servers to improve accuracy, formatting and performance. Use the OCR applications to convert the scanned images to text or searchable PDF, and SimpleIndex can extract index values from the text and automatically sort and organize the files.

Full Text Indexing OCR OCR Form Processing OCR Scanning Office PDF Text Processing PDF Data Extraction Software Text Processing Unattended Processing Zone OCR

Published in OCR

No Comments

Is there a way to just use part of a bar code or OCR value? For example, extract “50” from the value “124450”

Wednesday, 28 February 2018 by dwilder

Please refer to the Wiki Documentation for the complete Bar Code Recognition reference.

To do this example, create a barcode field (Field 1 for example) and a 2nd field with type “Fixed”. In the template for the 2nd field, enter %FIELD1[5,2]% to get “50” from “124450”.

%FIELD1% would get the entire value for Field #1, the barcode field. By adding the [5,2] you tell SimpleIndex to start at the 5th character (5) and take 2 characters from the value (50).

Find out more about barcode scanning on our Barcode Scanning Guide and read up on Optical Character Recognition on the SimpleOCR scanning solutions guide.

Published in Bar Codes, OCR, Office PDF Text Processing

No Comments

If I have a form which is filled manually by hand, can SimpleIndex read the data from it?

Wednesday, 28 February 2018 by dwilder

Please refer to the Wiki Documentation for the complete Handprint Recognition reference.

SimpleIndex offers two kinds of ICR (Intelligent Character Recognition) for converting printed and script handwriting to text.

The Cloud OCR feature enables the Amazon AWS Textract OCR engine, that has the ability to read unconstrained print and scripted handwriting with surprisingly good accuracy.

The FineReader OCR engine offers handprint recognition designed for forms processing. it is optimized for hand-filled forms that use letter boxes or combs to ensure each letter is separated. FineReader will also work with underlined text as long as it is printed. For cursive scripts the Cloud OCR option is recommended.

OCR Form Processing

Published in OCR

No Comments

How do you train the OCR engine for better accuracy?

Wednesday, 28 February 2018 by dwilder

Training has been removed with version 7 due to the addition of the ABBYY FineReader OCR engine.

Invoice OCR OCR OCR Form Processing OCR Scanning Screen Scraping OCR Screenshot OCR TWAIN Scanning Software Unattended Processing Zone OCR

Published in OCR

No Comments

How do you configure full text searching in Retrieval mode?

Wednesday, 28 February 2018 by dwilder

Please refer to the Wiki Documentation for the complete Database Settings reference.

On the Database tab there dropdown in the lower portion of the panel for Full Text OCR Field. Put the name of the field that will store the full-text data there. This must be configured both for Insert and Retrieval mode configurations. The database field needs to be sufficient length to store the entire text of your document.

Of course, the Insert Mode configuration must have “Enable Full Page OCR” checked to generate full text data from images. Text from MS Office documents, PDF files and existing OCR text files can be used without setting this option.

When designing your Retrieval Mode configuration, create a Text field to use for full text search queries. On the Database tab, set the corresponding “Database Field Name” to the full text database field.

When searching on your full text field, SimpleIndex finds the text you enter no matter where it appears in the document. It is able to match partial words. It does not perform boolean or natural language searches. The text entered must match the document text exactly.

Published in Database & Retrieval, OCR

No Comments

How do you configure OCR to read index information from MS Office or PDF documents?

Wednesday, 28 February 2018 by dwilder

Please refer to the Wiki Documentation for the complete Zones & OCR Settings reference.

MS Office and PDF files generated by software or PDF printer drivers already have the text you need to recognize in the file. Scanned documents need to use OCR to read text from an image of the page. With Office and PDF files, SimpleIndex can just read the text, which is much faster and accurate than image OCR.

To recognize index fields from the document text, first create OCR fields on the Index tab as you would normally. Next, on the Zones & OCR options tab, check the “Use Full Page OCR for this Field” option for each OCR field. This tells SimpleIndex to process the existing file text.

If the index value is a unique pattern of digits or list of possible values, use Template or Dictionary matching to locate the value within the text. Please see the manual for details on Template and Dictionary matching.

If the value appears in a specific location in each file, coordinates can be used to locate it. When processing text, the X, Y, Width and Height settings correspond to line and column numbers within the file text. This is explained in greater depth in the manual.

SimpleIndex will assume that any TXT file with the same name as a file being processed is the OCR text for that file, so this method can work with any type of file.

Find out more about Optical Character Recognition on the SimpleOCR Guide.

Microsoft Word Data Extraction MS Office Office PDF Document Indexing Office PDF Text Processing Office to PDF Paperless Office PDF PDF Archive Scanning Software PDF Barcode Recognition PDF Data Extraction Software PDF Forms Text Processing Unattended Processing

Published in OCR, Office PDF Text Processing

No Comments

How can I improve recognition rates for my OCR fields?

Wednesday, 28 February 2018 by dwilder

There are several things you can do to improve accuracy for OCR.

Scan at 300dpi, black & white for best results.
Adjust the scan settings to remove background noise and improve the definition of characters.
For Zone OCR, field recognition can often vary based on the surrounding white space and text in the zone. Try varying the size of the zone to achieve optimal results.
For template matching, make sure all variations of the field format are included in the template list.
For dictionary matching, add common variations and OCR mistakes to the “thesaurus” list.
On the Zones & OCR tab (accessed from the Job Options) you can adjust the Max Errors setting to allow for more mistakes in the dictionary matching process.
Use the Strip Spaces, Strip Characters, Replace Characters and Case Fixing options to standardize the field format prior to matching.

Please refer to the SimpleIndex Wiki for details on how to configure these options.

Can OCR text be saved to Office, Text, HTML or other formats?

Wednesday, 28 February 2018 by dwilder

Yes. On the OCR step of the Job Settings Wizard you can select the text output format need in the “Full-page OCR file type” drop down. By default it is set to PDF, but can be changed to Text (txt), Word (docx), Rich Text (rtf), Open Office (odt), Excel (xlsx), PowerPoint (pptx), ePub Zip (epub), FictionBook (fb2), HTML (htm), XML (xml) or Alto XML (alto.xml).

If the output file type is set to PDF, OCR text will be embedded as hidden text in the PDF file.

Can SimpleIndex create searchable PDF Image+Text files with hidden text?

Wednesday, 28 February 2018 by dwilder

Yes, it can. You can configure this setting in the Job Settings Wizard by going to the OCR step and checking “Enable full-page OCR”. There are many settings in the OCR step that you can used to customize the output and recognition of images.

SimpleIndex has two different OCR engines (Standard and Professional) that can be used to produced PDF Image + Text files or Searchable PDFs.

Language Pack for Standard/Tesseract OCR

Languages Supported in SimpleSoftware OCR Engines

What is Document Imaging?

Change the Dictionary Separator Value

Change the OCR Font or Type

Regular Expression (RegEx) – Syntax or Type

Autonumber Increment Value

I’m using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?

Is there a way to just use part of a bar code or OCR value? For example, extract “50” from the value “124450”

If I have a form which is filled manually by hand, can SimpleIndex read the data from it?

How do you train the OCR engine for better accuracy?

How do you configure full text searching in Retrieval mode?

How do you configure OCR to read index information from MS Office or PDF documents?

How can I improve recognition rates for my OCR fields?

Related Links

Can OCR text be saved to Office, Text, HTML or other formats?

Related Links

Can SimpleIndex create searchable PDF Image+Text files with hidden text?

Related Links

SIGN IN YOUR ACCOUNT TO HAVE ACCESS TO DIFFERENT FEATURES

FORGOT YOUR DETAILS?

CREATE ACCOUNT

Related Links

Related Links

Related Links