OCR Options

From Simple Wiki

Back to Settings Wizard

Setup Job Configuration OCR Options Screen

The OCR settings on this page determine how full page OCR text is processed, and Zone OCR settings that apply to all OCR fields. Screen Shot OCR is also configured here.

Enable Full-Page OCR[edit]

Selecting this option will cause all image files in the batch to be OCR’ed. The entire file is processed, generating full-text data that can be used for auto-indexing and Full-Text Searching.

It is also possible to have the OCR results output to a number of additional File Formats such as word processor, spreadsheet or e-books.

Skip OCR if Text Exists[edit]

When enabled, any file that has embedded text will skip full page OCR (Native and Image+Text PDF files, images with a TXT file). This prevents SimpleIndex from doing lengthy full page OCR on files that have already been OCRed or generated as a files with electronic text that are imported from the Input folder or Email.

Fast OCR[edit]

Check this option to use a faster but less accurate OCR analysis on images. For high-quality images using standard fonts, Fast OCR provides comparable accuracy with much faster processing speed. In many cases, it is much faster to use this option even if a few more files require manual correction. This is only available with the Professional OCR (FineReader) engine.

Enable MRC Compression[edit]

Significantly reduces the file size of PDF files with minimal impact on the quality of the image. The Professional/Advance OCR is required to use this function. For more details about MRC technology check the MRC Wikipedia Page

Full-Page OCR File Type[edit]

Select the type of file that is output by the full-page OCR engine.

The available OCR File Formats depends on whether you are using Tesseract, FineReader, SimpleOCR or Cloud OCR.

OCR Engine[edit]

The Standard license includes the Tesseract OCR engine. The FineReader OCR engine is available with the Professional license or as an add-on.

Though FineReader performs better in virtually all cases, it may be necessary to select Tesseract manually to develop jobs for use with the Standard license on a Professional workstation.

If a job is configured for FineReader but is run on a Standard license, the OCR engine will switch to Tesseract automatically.

The AWSText, AWSForms, and AWSInvoice engines all use the Cloud OCR feature to provide enhanced text, handwriting, and field extraction using the Amazon AWS Textract service.

AWS Creds[edit]

This button allows entry of the Amazon Credentials to connect the Amazon Account to the AWSText, AWSForms, and AWSInvoice engines in SimpleIndex to enable Textract processing. This will keep track of the number of pages on that account and charge a monthly fee for the pages used.

Amazon Credential Requirements:

  • AWS Region
  • AWS Access Key ID
  • AWS Secret Access Key

You can find more about Textract and how to connect the Amazon Account on Cloud OCR

Output Full-Page OCR Files[edit]

When this option is checked, full-page OCR text is written to text files using the same folder and filename scheme as the images.

If unchecked, no text files are created. Text from MS Office and PDF files are also be saved as text when selected.

OCR Language[edit]

Select the default language for OCR text. The languages that can be selected depends on whether you are using Tesseract, FineReader, SimpleOCR or Cloud OCR.

Output zone OCR data to text files[edit]

This setting once checked will output the Zone OCR data extracted from the pages in the page to a Text (txt) file and save to the Output folder.

Append during OCR to Field[edit]

By default, the OCR to Field option automatically advances to the next field after you draw a zone. Select this option to keep the cursor in the selected field so additional text can be added. This is useful for capturing data from multiple lines or regions into the same field.

Zoom to Zone when Field is Selected[edit]

Zoom locking causes the image to be zoomed in on the zone automatically when a field is selected. If no coordinates are indicated, a field will not zoom. However, OCR fields must always have coordinates. Zoom locking can be helpful when reviewing OCR results or keying in handprint data. Disable zoom locking if you prefer to keep your selected zoom on each field.

Page Break Text[edit]

The following value is output after each OCR page to indicate page breaks in the text file. This string is also used to parse the OCR output when creating searchable PDF files. In this case, the value should be something that will never occur naturally within the text of your documents.

Spaces to Strip[edit]

This option allows you to modify the default behavior for space trimming on OCR fields. It affects all OCR fields where Strip spaces from result is selected.

The most common options are 'Remove all spaces and tabs' combined with 'Remove all line feeds' to remove all whitespace from the result.

'Convert all blank space to single spaces' is useful if the space is needed to distinguish between values but there can be a variable number of spaces between elements, such as a label and field value on a form.

Click the 'Set' button to display the 'Replace Spaces and Line Breaks' dialog. Select one or more from the following options:

  • Remove spaces & line feeds from beginning
  • Remove spaces & line feeds from end
  • Remove all spaces and tabs
  • Remove all line feeds
  • Convert all blank space to single spaces
  • Remove all non-alphanumeric characters except spaces
  • Replace line feeds with <lf>
  • Remove all non-alpha
  • Remove all non-numbers
  • Run Trim function after template matching
  • Perform replacements AFTER template matching

Screen Shot OCR Window Name[edit]

Screen Shot OCR lets you index documents using data that appears on the screen of another application. If this value is set, a screen capture of the specified window will be taken automatically when you select OCR the Clipboard from the Process menu. The screen capture is then processed using the OCR job settings.

To enable this feature, enter a unique portion of the window title (the text that appears at the top of the window with the application name). Many applications modify the title when you open a document or perform certain functions, so be sure to use the portion of the title that remains constant and uniquely identifies the window among all open applications.

Use Clipboard/Screen Shot OCR Only[edit]

Check this box to disable OCR during batch processing and use OCR settings only for clipboard, screen shot, or manual OCR.

OCR Training Video[edit]

Video was recorded in a previous version of SimpleIndex. Refer to the wiki documentation for latest updates.

Next Step Barcode Options

Related Knowledge Base Articles[edit]