OCR Options

The OCR settings on this page determine how full page OCR text is processed, and Zone OCR settings that apply to all OCR fields. Screen Shot OCR is also configured here.

Enable Full-Page OCR[edit | edit source]

Selecting this option will cause all image files in the batch to be OCR’ed. The entire file is processed, generating full-text data that can be used for auto-indexing and Full-Text Searching.

It is also possible to have the OCR results output to a number of additional File Formats such as word processor, spreadsheet or e-books.

Skip OCR if Text Exists[edit | edit source]

When enabled, any file that has embedded text will skip full page OCR (Native and Image+Text PDF files, images with a TXT file). This prevents SimpleIndex from doing lengthy full page OCR on files that have already been OCRed or generated as a files with electronic text that are imported from the Input folder or Email.

Fast OCR[edit | edit source]

Check this option to use a faster but less accurate OCR analysis on images. For high-quality images using standard fonts, Fast OCR provides comparable accuracy with much faster processing speed. In many cases, it is much faster to use this option even if a few more files require manual correction. This is only available with the Professional OCR (FineReader) engine.

Enable MRC Compression[edit | edit source]

Significantly reduces the file size of PDF files with minimal impact on the quality of the image. The Professional/Advance OCR is required to use this function. For more details about MRC technology check the MRC Wikipedia Page

Full-Page OCR File Type[edit | edit source]

Select the type of file that is output by the full-page OCR engine.

The available OCR File Formats depends on whether you are using Tesseract, FineReader, SimpleOCR or Cloud OCR.

OCR Engine[edit | edit source]

The Standard license includes the Tesseract OCR engine. The FineReader OCR engine is available with the Professional license or as an add-on.

Though FineReader performs better in virtually all cases, it may be necessary to select Tesseract manually to develop jobs for use with the Standard license on a Professional workstation.

If a job is configured for FineReader but is run on a Standard license, the OCR engine will switch to Tesseract automatically.

The AWSText, AWSForms, and AWSInvoice engines all use the Cloud OCR feature to provide enhanced text, handwriting, and field extraction using the Amazon AWS Textract service.

AWS Creds[edit | edit source]

This button allows entry of the Amazon Credentials to connect the Amazon Account to the AWSText, AWSForms, and AWSInvoice engines in SimpleIndex to enable Textract processing. This will keep track of the number of pages on that account and charge a monthly fee for the pages used.

Amazon Credential Requirements:

AWS Region
AWS Access Key ID
AWS Secret Access Key

You can find more about Textract and how to connect the Amazon Account on Cloud OCR

Output Full-Page OCR Files[edit | edit source]

When this option is checked, full-page OCR text is written to text files using the same folder and filename scheme as the images.

If unchecked, no text files are created. Text from MS Office and PDF files are also be saved as text when selected.

OCR Language[edit | edit source]

Select the default language for OCR text. The languages that can be selected depends on whether you are using Tesseract, FineReader, SimpleOCR or Cloud OCR.

Output zone OCR data to text files[edit | edit source]

This setting once checked will output the Zone OCR data extracted from the pages in the page to a Text (txt) file and save to the Output folder.

Append during OCR to Field[edit | edit source]

By default, the OCR to Field option automatically advances to the next field after you draw a zone. Select this option to keep the cursor in the selected field so additional text can be added. This is useful for capturing data from multiple lines or regions into the same field.

Zoom to Zone when Field is Selected[edit | edit source]

Zoom locking causes the image to be zoomed in on the zone automatically when a field is selected. If no coordinates are indicated, a field will not zoom. However, OCR fields must always have coordinates. Zoom locking can be helpful when reviewing OCR results or keying in handprint data. Disable zoom locking if you prefer to keep your selected zoom on each field.

Page Break Text[edit | edit source]

The following value is output after each OCR page to indicate page breaks in the text file. This string is also used to parse the OCR output when creating searchable PDF files. In this case, the value should be something that will never occur naturally within the text of your documents.

Spaces to Strip[edit | edit source]

This option allows you to modify the default behavior for space trimming on OCR fields. It affects all OCR fields where Strip spaces from result is selected.

The most common options are 'Remove all spaces and tabs' combined with 'Remove all line feeds' to remove all whitespace from the result.

'Convert all blank space to single spaces' is useful if the space is needed to distinguish between values but there can be a variable number of spaces between elements, such as a label and field value on a form.

Click the 'Set' button to display the 'Replace Spaces and Line Breaks' dialog. Select one or more from the following options:

Remove spaces & line feeds from beginning
Remove spaces & line feeds from end
Remove all spaces and tabs
Remove all line feeds
Convert all blank space to single spaces
Remove all non-alphanumeric characters except spaces
Replace line feeds with <lf>
Remove all non-alpha
Remove all non-numbers
Run Trim function after template matching
Perform replacements AFTER template matching

Screen Shot OCR Window Name[edit | edit source]

Screen Shot OCR lets you index documents using data that appears on the screen of another application. If this value is set, a screen capture of the specified window will be taken automatically when you select OCR the Clipboard from the Process menu. The screen capture is then processed using the OCR job settings.

To enable this feature, enter a unique portion of the window title (the text that appears at the top of the window with the application name). Many applications modify the title when you open a document or perform certain functions, so be sure to use the portion of the title that remains constant and uniquely identifies the window among all open applications.

Use Clipboard/Screen Shot OCR Only[edit | edit source]

Check this box to disable OCR during batch processing and use OCR settings only for clipboard, screen shot, or manual OCR.

Creating OCR Configurations Training Video[edit | edit source]

Takes a look under the hood of the Zone OCR sample job to see how it is configured. Learn to draw OCR zones and create basic templates.

Next Step Barcode Options