OCR Options

From Simple Wiki
Revision as of 23:07, 6 January 2022 by Cattieb88 (talk | contribs) (Created page with "File:SimpleIndex Simple Setup Configuration Wizard OCR Jobs Steps.png|300px|thumb|alt=SimpleIndex Simple Setup Configuration Wizard OCR Jobs Steps|SimpleIndex Simple Setup C...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
SimpleIndex Simple Setup Configuration Wizard OCR Jobs Steps
SimpleIndex Simple Setup Configuration Wizard OCR Jobs Steps

The OCR settings on this page determine how full page OCR text is processed, and Zone OCR settings that apply to all OCR fields. Screen shot OCR is also configured here.

Enable Full-Page OCR[edit | edit source]

Selecting this option will cause all image files in the batch to be OCR’ed. The entire file is processed, generating full-text data that can be used for auto-indexing and full-text searching.

It is also possible to have the OCR results output to MS Word or HTML formats that can be edited in a word processor.

Skip OCR if Text Exists[edit | edit source]

When enabled, any file that has embedded text will skip full page OCR (Native and Image+Text PDF files, images with a TXT file).

Fast OCR[edit | edit source]

Check this option to use a faster but less accurate OCR analysis on images. For high-quality images using standard fonts, Fast OCR provides comparable accuracy with much faster processing speed. In many cases, it is much faster to use this option even if a few more files require manual correction.

Full-Page OCR File Type[edit | edit source]

Select the type of file that is output by the full-page OCR engine. If you have OCR zones defined this must be set to TEXT. Other options are WORD or HTML.

OCR Engine[edit | edit source]

The Standard license includes the Tesseract OCR engine. The FineReader OCR engine is available with the Professional license or as an add-on. Though FineReader performs better in virtually all cases, it may be necessary to select Tesseract manually to develop jobs for use with the Standard license on a Professional workstation.

If a job is configured for FineReader but is run on a Standard license, the OCR engine will switch to Tesseract automatically.

Output Full-Page OCR Files[edit | edit source]

When this option is checked, full-page OCR text is written to text files using the same folder and filename scheme as the images. If unchecked, no text files are created. Text from MS Office and PDF files are also be saved as text when selected.

OCR Language[edit | edit source]

Select the default language for OCR text. By default, only English, French, Spanish, Italian, and German languages are installed. Additional languages can be provided by request.

Append during OCR to Field[edit | edit source]

By default, the OCR to Field option automatically advances to the next field after you draw a zone. Select this option to keep the cursor in the selected field so additional text can be added. This is useful for capturing data from multiple lines or regions into the same field.

Zoom to Zone when Field is Selected[edit | edit source]

Zoom locking causes the image to be zoomed in on the zone automatically when a field is selected. If no coordinates are indicated, a field will not zoom. However, OCR fields must always have coordinates. Zoom locking can be helpful when reviewing OCR results or keying in handprint data. Disable zoom locking if you prefer to keep your selected zoom on each field.

Page Break Text[edit | edit source]

The following value is output after each OCR page to indicate page breaks in the text file. This string is also used to parse the OCR output when creating searchable PDF files. In this case, the value should be something that will never occur naturally within the text of your documents.

Spaces to Strip[edit | edit source]

This option allows you to modify the default behavior for space trimming on OCR fields. It affects all OCR fields where Strip spaces from result is selected. Add the numbers for all space trimming options you want to enable and enter the total.

1 - Remove spaces & line feeds from beginning 2 - Remove spaces & line feeds from end 4 - Remove all spaces and tabs 8 - Remove all line feeds 16 - Convert all blank space to single spaces 32 - Remove all non-alphanumeric characters except spaces 64 - Replace line feeds with <lf> 128 - Remove all non-alpha 256 - Remove all non-numbers 16384 - Run Trim function after template matching

Screen Shot OCR Window Name[edit | edit source]

Screen Shot OCR lets you index documents using data that appears on the screen of another application. If this value is set, a screen capture of the specified window will be taken automatically when you select OCR the Clipboard from the Process menu. The screen capture is then processed using the OCR job settings.

To enable this feature, enter a unique portion of the window title (the text that appears at the top of the window with the application name). Many applications modify the title when you open a document or perform certain functions, so be sure to use the portion of the title that remains constant and uniquely identifies the window among all open applications.

Use Clipboard/Screen Shot OCR Only[edit | edit source]

Check this box to disable OCR during batch processing and use OCR settings only for clipboard, screen shot, or manual OCR.