Optical Character Recognition
Languages Supported in SimpleSoftware OCR Engines
SimpleSoftware OCR engines are using two different systems for language support. In the end languages supported by your OCR is based on your version of SimpleIndex installed, any addons (SimpleIndex Server, SimpleCoversheet, and so on) do not add any additional language support.
All SimpleSoftware products have Tesseract 3.02 OCR languages support. You can learn more about it and download additional language libraries HERE. And you can check and add more OCR languages libraries supported with Tesseract on your station here:
C:\Program Files (x86)\SimpleIndex\Tesseract\v3.02\tessdata
SimpleIndex Pro and SimpleIndex OCR are using FineReader engine. It has one of the largest libraries of supported OCR languages. You can check OCR languages supported with FineReader on your station here:
C:\Program Files (x86)\SimpleIndex\OCRLanguages.txt
Abkhaz
Adyghe
Afrikaans
Agul
Albanian
Altaic
Armenian Eastern
Armenian Grabar
Armenian Western
Awar
Aymara
Azeri Cyrillic
Azeri Latin
Bashkir
Basque
Belarusian
Bemba
Blackfoot
Breton
Bugotu
Bulgarian
Buryat
Catalan
Chamorro
Chechen
Chukcha
Chuvash
Corsican
Crimean Tatar
Croatian
Crow
Czech
Danish
Dargwa
Dungan
Dutch Belgian
Dutch Standard
English
English Australian
English Belize
English Canadian
English Caribbean
English Ireland
English Jamaica
English Law
English Medical
English New Zealand
English Philippines
English South Africa
English Trinidad
English United Kingdom
English United States
English Zimbabwe
Eskimo Cyrillic
Eskimo Latin
Esperanto
Estonian
Even
Evenki
Faeroese
Fijian
Finnish
French
French Belgian
French Canadian
French Luxembourg
French Monaco
French Standard
French Swiss
Frisian
Friulian
Gaelic Scottish
Gagauz
Galician
Ganda
German
German Austrian
German Law
German Liechtenstein
German Luxembourg
German Medical
German New Spelling
German New Spelling Law
German New Spelling Medical
German Standard
German Swiss
Greek
Guarani
Hani
Hausa
Hawaiian
Hungarian
Icelandic
Ido
Indonesian
Ingush
Interlingua
Irish
Italian
Italian Standard
Italian Swiss
Kabardian
Kalmyk
Karachay Balkar
Karakalpak
Kasub
Kawa
Kazakh
Khakas
Khanty
Kikuyu
Kirgiz
Kongo
Koryak
Kpelle
Kumyk
Kurdish
Lak
Lappish
Latin
Latvian
Latvian Gothic
Lezgin
Lithuanian
Lithuanian Classic
Luba
Macedonian
Malagasy
Malay Brunei Darussalam
Malay Malaysian
Malinke
Maltese
Mansi
Maori
Mari
Maya
Miao
Minankabaw
Mohawk
Mongol
Mordvin
Nahuatl
Nenets
Nivkh
Nogay
Norwegian Bokmal
Norwegian Nynorsk
Null
Nyanja
Occidental
Ojibway
Old English
Old French
Old German
Old Italian
Old Spanish
Ossetic
Papiamento
Pidgin English
Polish
Portuguese Brazilian
Portuguese Standard
Provencal
Quechua
Rhaeto Romanic
Romanian
Romanian Moldavia
Romany
Ruanda
Rundi
Russian
Russian Moldavia
Russian Old Spelling
Samoan
Selkup
Serbian Cyrillic
Serbian Latin
Shona
Sioux
Slovak
Slovenian
Somali
Sorbian
Sotho
Spanish
Spanish Argentina
Spanish Bolivia
Spanish Chile
Spanish Colombia
Spanish Costa Rica
Spanish Dominican Republic
Spanish Ecuador
Spanish El Salvador
Spanish Guatemala
Spanish Honduras
Spanish Mexican
Spanish Modern Sort
Spanish Nicaragua
Spanish Panama
Spanish Paraguay
Spanish Peru
Spanish Puerto Rico
Spanish Traditional Sort
Spanish Uruguay
Spanish Venezuela
Sunda
Swahili
Swazi
Swedish
Swedish Finland
Tabassaran
Tagalog
Tahitian
Tajik
Tatar
Tinpo
Tongan
Tswana
Tun
Turkish
Turkmen
Tuvin
Udmurt
Uighur Cyrillic
Uighur Latin
Ukrainian
Uzbek Cyrillic
Uzbek Latin
Visayan
Welsh
Wolof
Xhosa
Yakut
Yiddish
Zapotec
Zulu
What is the point of SimpleQC?
SimpleQC is now SimpleView with many enhancements. In a nutshell it is designed to let you quickly browse folders containing multi-page TIFF or PDF documents. The two main uses for this are:
1 Review scanned documents for Quality Control
Occasionally a scanned document will be too light or too dark to be read. This can happen quite often with some types of paper. Use SimpleView to find these pages quickly and rescan them. You can also correct page order, rotation, skew, etc.
2 Use as a document viewer
SimpleIndex and other scanning applications create folders and files on your hard drive or network to store documents. Use SimpleView to quickly browse image thumbnails by folder and filename. Auto-rotate, enhance and OCR images as needed.
SimpleView is different from other thumbnail viewers because:
-It loads multi-page TIFF files very quickly
-It displays thumbnails for files as well as pages within multi-page files on the same screen
-It has many functions for document QC such as auto-selecting even and odd pages or files for rotation, rescan pages
-It displays thumbnails for PDF files and displays them in the Acrobat viewer
-With Acrobat Standard or Pro you can enable editing & signing of PDF files
-Viewing of office documents and electronic formats are also available
- Published in SimpleView
Can OCR text be saved to Office, Text, HTML or other formats?
Yes. On the OCR step of the Job Settings Wizard you can select the text output format need in the “Full-page OCR file type” drop down. By default it is set to PDF, but can be changed to Text (txt), Word (docx), Rich Text (rtf), Open Office (odt), Excel (xlsx), PowerPoint (pptx), ePub Zip (epub), FictionBook (fb2), HTML (htm), XML (xml) or Alto XML (alto.xml).
If the output file type is set to PDF, OCR text will be embedded as hidden text in the PDF file.
Related Links
- Published in Licensing & Installation, OCR
Can SimpleIndex create searchable PDF Image+Text files with hidden text?
Yes, it can. You can configure this setting in the Job Settings Wizard by going to the OCR step and checking “Enable full-page OCR”. There are many settings in the OCR step that you can used to customize the output and recognition of images.
SimpleIndex has two different OCR engines (Standard and Professional) that can be used to produced PDF Image + Text files or Searchable PDFs.
Related Links
- Published in Export, OCR, Office PDF Text Processing