Index Field Wizard: Difference between revisions

Revision as of 02:58, 7 January 2022

Use the Add button to create a new index field, or select an existing field and click Edit to modify its settings.

Field Type[edit | edit source]

The field type determines which of the following screens will be displayed for advanced settings. Field types determine which data will be accepted by the field and which automation will be used to read the index value from documents.

Autofill[edit | edit source]

All fields of this type are automatically populated with values from your database once the Key Field has been entered. The Template setting for this field must be set to the name of the corresponding field in your database.

Autonumber[edit | edit source]

Allows you to have a field with a numeric value that will increment automatically under certain conditions. The Template value for this field determines the seed number, which can be any combination of letters and numbers, as long as the last digit is numeric. Based on the value of the Autonumber Increment setting, the Autonumber can be set to increment every page, every blank page, every barcode, or at the end of each batch.

Barcode[edit | edit source]

If a barcode is recognized, the value is inserted into this field. Use the Template setting to force the field to accept only barcodes that match the specified pattern. This also allows you to match multiple barcodes to their appropriate fields, and ignore barcodes that are not meant to be used as indexes. Use the Barcode tab to configure other barcode settings.

Date[edit | edit source]

Field is formatted as a date in YYYY-MM-DD format by default. This allows for use of dates in folders and filenames and for proper sorting. For more information see the Template and Date Formatting options. Valid dates from Fixed field templates can also be used.

Filename[edit | edit source]

Field is automatically populated with the original filename of the image from the Input folder. Does not include the input file path.

Fixed[edit | edit source]

Calculated value from the Template setting is used. There are many variables you can use to automatically insert, such as file property settings, all or part of the file and folder names, combinations of other field values, and system settings like the user ID, computer name, etc..

With a Fixed field the user cannot change the calculated value. Use a Text, Numbers or Date field to allow the user to modify a calculated value.

List[edit | edit source]

SimpleIndex Simple Setup Wizard Configuration Jobs Index Field List

Possible index values are displayed in a drop-down list, allowing the user to select one or automatically fill in the field with matching records as they type. The list may be populated using either a text file or database.

To populate the list with a text file, create a file in Notepad that has a single entry on each line and enter the path to this text file in the List File/Field setting. If no text file is specified and you have a database configured, the list for this field is populated automatically with the values from the corresponding database field.

Numbers[edit | edit source]

Only numeric values are accepted. Valid numbers from Fixed field templates can also be used.

OCR[edit | edit source]

If an OCR value is recognized, it is inserted into this field. Use the Template setting with this field type to search the OCR region for the first string that matches the pattern. Use the List File/Field option to match OCR text against a list of possible values. Use the Zones & OCR tab to configure other OCR settings.

OMR[edit | edit source]

Use this type for check-box fields. Field is considered “checked” if the number of black pixels in the region is greater than the number entered in the Template setting.

OMR fields can also be used to extract a region from an image and save it to a separate file. Enter a negative number in the Template setting to save the region to a separate file if the number of black pixels is greater than the absolute value of this number.

Template[edit | edit source]

Forces the user to enter an index value that matches the pattern specified in the Template setting for this field. See Template Control for the formatting instructions.

Text[edit | edit source]

User may enter any text into the index field. Template setting is used as a default value. Fixed field templates can also be used to use a calculated value for the default.

Index Field[edit | edit source]

Enter the name or label to use to identify the field. File naming options can be selected here, but these options are more easily configured from the Index & File Naming screen so you can see how they interact with the other index fields.

For OCR and Barcode fields, the Text Matching Type option will be displayed. Select the desired option to display the corresponding wizard page in the following step.

When you select Both, the template will be matched first and then the dictionary list is matched against the template search result. This can prevent false positives when dictionary terms can appear in other places on the document.

If a data source has been configured, the Database Mapping options will be displayed. Select the corresponding field in the database to use for data export.

Required[edit | edit source]

When this option is selected, the user will not be able to finalize a batch unless all images have been saved with a value for this field.

Folder[edit | edit source]

This option uses the index value to create subfolders in the Output folder. If multiple folder fields are selected, nested subfolders are created in order from top to bottom.

Filename[edit | edit source]

When this option is selected, the image files are renamed using this index field value. If multiple fields have this option checked, the filename will contain all the values in order, separated by the Field Separator character.

Forward[edit | edit source]

This option “carries forward” the field value to subsequent images until a new saved value is encountered. Use this to index multi-page documents without having to re-type the index data for each page. When unchecked, each page must be indexed individually.

When using coversheets created with SimpleCoversheet or another barcode application, the forward option will automatically apply the barcode values to all the pages between the coversheets.

Database Mapping[edit | edit source]

Use these settings to map the index field to a field in your database. Depending on the selected Database Mode, records will be added, modified or searched, and List fields will be populated with unique records from this field.

Database Field Name[edit | edit source]

Select the database fields that correspond to the fields you define under the Index tab. If there is no corresponding database field, then leave this value blank.

Editable[edit | edit source]

This option is only used in Update mode. For each field, select this option if you want to allow the user to edit the values in this field. Leave it unchecked if you want to use the existing values for reference or file naming only and not allow the user to modify its value. 5

Filter[edit | edit source]

This option allows you to define default search criteria for Retrieval and Update modes. Whenever the search screen is displayed, the value(s) entered here is displayed in the search criteria for that field. This makes it possible to add default filters to automatically search a certain subset of documents or make it easier to perform searches by partially filling search fields.

Zones & Advanced OCR Settings[edit | edit source]

Zones can be used for OCR, OMR and Barcode fields to define a region on the image that contains the field data. Zones can also be used to automatically zoom in on a region of the image when the field is selected.

Setting Zone Coordinates[edit | edit source]

Click the Set Zone button to set the zone coordinates for this field. This will show the Draw Field Zone window.

To set the zone, click the Open or Scan button to obtain a sample image. Click and drag the mouse to draw a box around the region you want to use for this field.

For multi-page files, use the Page buttons to change to the page you want or enter the page number in the box.

It is also possible to perform zone OCR on the last page of each document by entering a negative number for the Page on the wizard screen. Set to -1 to OCR the last page, or -2 for the next to last page, etc.

When finished, click Save to keep the new zone coordinates or Cancel to discard.

Text Source[edit | edit source]

Zone coordinates can indicate pixel coordinates in an image, or row and column numbers in a text file. Set the Text Source to Use Full Page Text to use existing text from PDF files, MS Office documents and full page OCR as the source text for this field.

You can also pick another field from the list to use that field's value as the source text for this field. This lets you capture a large block of text like an address block with Zone OCR, then setup fields for Name, Address, City, State, etc. that use the address block field as the Text Source.

Use the X, Y coordinates to indicate a row and column within the source text. Use Width and Height to indicate the number of columns and rows to capture. Entering all 0's will search the entire file.

Advanced OCR Field Settings[edit | edit source]

These settings let you format the OCR results prior to dictionary and template matching. This allows you to perform various text replacements, remove invalid characters, and standardize spacing and letter case.

Pages to Process[edit | edit source]

Using this option, you may limit the OCR to only certain pages within the batch. This option greatly speeds up the OCR process if you know the location of those pages in the batch that contain the index information you need. The options are:

Every Page – all pages are processed.
First Page Only – only the first page in the batch is processed.
Pages with Barcodes – only a page where a barcode is detected is processed. Use the Template and zone features to prevent detection of stray barcodes.
Pages After Barcode – use this option with separator sheets, like the ones created by SimpleCoversheet, where the first page of the document comes after a barcode separator sheet.
Pages After Blank – use this option with blank page separators to indicate the start of a new document on the following page.
Odd Pages – OCR only Odd numbered pages (1,3,5,etc.)
Even Pages – OCR only Even numbered pages (2,4,6,etc.)
Pages without Barcodes – only pages where a barcode is not detected are processed. Useful for capturing the same field value with OCR when a barcode is not present or unreadable.

Case Fixing[edit | edit source]

This option allows you to automatically case fix the OCR results, forcing the results to be all UPPER CASE, lower case, or Title Case (first letter of each word). If a Dictionary File is specified, the case used in that file will override this setting.

Strip spaces from result[edit | edit source]

This option strips any spaces from the OCR result. This is very useful when using template matching or dictionary lookups, because spaces are sometimes recognized by mistake, causing the match to not be found. The Spaces to Strip (5.12.11) option can be used to modify the behavior of this function to strip other classes of characters.

Strip Characters from Result[edit | edit source]

Enter a list of characters that you want to remove from OCR results prior to template and dictionary matching. You can also use this in place of templates by removing all unwanted characters from your OCR zone and leaving the results. This technique allows you to get a partial result when recognition mistakes take place, when templates or dictionaries will leave a blank field.

This setting can also be used with non-OCR fields to remove unwanted characters from barcodes, database fields, dates, etc.

Here are several helpful hints for using this setting:

Enter the values %LF% and %TAB% to remove line breaks and tab characters, since these cannot be typed.
There are several examples available in the drop-down menu with common lists of characters that can be selected automatically.
You can manually type or copy/paste values into this field.
A good technique to use is to copy and paste any extra characters that appear in that field during OCR until only valid characters remain.
Use Notepad to edit a long list of special characters or to save lists for later use.
Use the Character Map (in Start Menu/Accessories) to find special characters.
Enter %##% to replace a specific ASCII character with numeric value of ##. For example, %13% will remove line feeds.
Set the Replace Character option to replace stripped characters with another.

Replace Character[edit | edit source]

Enter a character or characters here that will replace those stripped using the Strip Characters From Result option. This allows you to replace common mistakes, such as I and 1 or O and 0, or substitute a space or dash for line feeds and other unwanted characters.

ASCII character codes may be entered in this field to allow special characters to be used for replacement. For example, the single space character can be entered as %32%, Line Feeds are %10% and Tabs are %9%. A full list of ASCII character codes can be found if you search the web for “ASCII Table”.

Character Substitution[edit | edit source]

This option allows you to define several specific 'find and replace' operations on images that will take place before template and dictionary matching. This is useful for correcting common OCR errors automatically, such as a "1" being recognized as "I". Substitutions can be single characters or whole words and phrases. It is also possible to replace unprintable characters such as tabs and line feeds by entering their ASCII character code (e.g. %10% for line feeds).

In previous versions, replacements were set globally and applied to all OCR fields. In version 8.1 the replacements are set on the field level so you can do different replacements for each field. For example, replacing all I’s with 1 is useful for a numeric field but not text.

Template Control[edit | edit source]

The Template Control screen lets you create and test pattern matching templates used to extract data from OCR zones. Templates are also used in other fields to indicate pre-defined field values.

A list of valid templates for each field type is shown at the top. Select a template value and click Add to add it to the template.

You can type or use copy/paste to enter the Sample Text used to test pattern matching templates. Click the Test button to compare the template to the Sample Text. The first matching value will be displayed in the Result text box.

Use Regular Expressions[edit | edit source]

Check this option to enable Regular Expressions (RegEx), which allow you to define much more complex pattern matching templates using a standardized description language. Regular Expressions are a widely used standard, similar to “grep” for those familiar with UNIX.

It is possible to mix templates, having some use Regular Expressions and others use the SimpleIndex template format. Simply precede any template with ^^^ to indicate that template is a regular expression. This prefix will be added to the template automatically when the box is checked.

A complete description of regular expressions is beyond the scope of this document. However, you can search the web for the term “Regular Expression” to find many reference sites and samples for common data elements. Searching for the type of data you want with the term “Regular Expression” will usually take you right to an example of what you need. You will find there are often many ways to define the same pattern with RegEx. There are several RegEx formats available. SimpleIndex uses the "JavaScript" RegEx format, so keep this in mind when using third party RegEx tools.

Here are some example searches to help you find several common fields that are hard to capture with SimpleIndex templates but possible with Regular Expressions.

Email address regular expression
Phone number regular expression
Street address regular expression
City state zip regular expression
US Canada zip code regular expression
UK zip code regular expression

Strip Fixed Characters in Front of OCR Template[edit | edit source]

This option allows you to use text “markers” to determine the position of a field when there is no unique template or dictionary lookup possible. For instance, an invoice number may always follow the word “INVOICE” on certain documents. Checking this option will allow you to enter “INVOICE ####” as the OCR template, but only have the invoice number and not the word “INVOICE” show up in the field.

Strip Fixed Characters at End of OCR Template[edit | edit source]

Same as the previous option, but strips fixed characters from the end of the template instead of the beginning, in case the marker appears after the text you are trying to recognize.

Barcode, OCR and Template Fields[edit | edit source]

For Barcode, OCR, and Template field types, the Template setting represents a series of specific letter and number combinations that the field value must match. The possible values for the template are:

*: any character
#: numbers only
A: letters only
X: any letter or number
?: optional characters. When several ????’s are placed at the end of a template, SimpleIndex will accept any letters, numbers, or the characters ()-&%@, until a non-matching character is reached.
Other: character must match exactly
\: enter backslash before *, #, A, X, ? or \ to indicate an exact match for this character in the template instead of the variable value
|: use the pipe character to separate multiple search templates to allow searching of many variations on the field format

Some example templates are:

Invoice \#: #######: the phrase “Invoice #:” followed by a 7-digit invoice number
###-##-####: social security number
##/##/####|#/#/####|##/#/####|#/##/####: date with 4-digit year and 1 or 2-digit month and day
ABC**##??: Any letter, B and C only, any 2 characters, 2 numbers, 2 optional characters

Enter the template in the Sample Text box and click the Test button to see what match results from the sample text. There is some generic sample text provided that has examples of many common data elements like names, dates and numbers. To test the template with your own documents, copy and paste the OCR text into this window. The new sample text will be used next time you open the Template Editor.

There are also several built-in templates available to make it easy to find several common data elements:

%DATE% - find a date in any valid date format, including forms where the month is spelled out or abbreviated and 2 & 4 digit years.
%DATE2% - find a date with a 2-digit year.
%DATE4% - find a date with a 4-digit year.
%MONEY% - find an amount of money.
%PHONEUS% - find a US phone number in many common formats.
%SSN% - find a US social security number in ###-##-#### format.
%FIELD#% - get the template from another field value, usually an Autofill (5.9.15) field that lets you associate different templates with different documents, such as an invoice number template that is associated with a specific vendor name.

Text, Numbers and Date Fields[edit | edit source]

For Text, Numbers, and Date field types, the Template represents a default value that will appear automatically as the field value, but may be changed by the user if necessary.

For Date field types, you may also enter a Template for automatic date formatting. Enter %MM/DD/YYYY% to format dates in Month/Day/Year format. Use %YYYY-MM-DD% to format for proper sort order in filenames. Any of date format masks used in Microsoft Office applications like Excel and Access may be used. There is also a global date formatting option in the Advanced Options screen that will reformat all date values in any field type, including Barcode and OCR.

These field types also accept the same constant values that Fixed fields use, such as %TODAY% for today's date. You should use these instead of a Fixed field if you want to allow the user to edit the calculated value. See below for the complete list.

For search configurations, you can enter <, >, <=, or >= in the Template for Date or Numbers fields to enable date or number range searches. To create a minimum and maximum search field, create 2 fields that are linked to the same database field and enter >= for the minimum value and <= for the maximum.

List Fields[edit | edit source]

For List field types, the path to the text file containing the list values is entered in the List File/Field setting. The Template field should be left blank.

Autofill Fields[edit | edit source]

For Autofill fields, the name of the corresponding database field for use in the lookup should be entered.

Autonumber Fields[edit | edit source]

For Autonumber fields, you may enter any letter and number combination, as long as the last digit is numeric. The last number or numbers are used as the numeric value to increment, with the other characters remaining constant. It is recommended that you prefix the numeric value with enough 0’s to ensure all numbers are the same length and preserve their sort order.

F=== Fixed fields === For Fixed fields, the template represents a pre-set value that cannot be changed. There are several variables that may also be used to substitute a calculated value based on system settings, input file path, file properties, and other field values.

OMR Fields[edit | edit source]

For OMR fields, enter the minimum number of black pixels in the zone for it to be considered "checked". Keep in mind that a typical 300dpi image will have 300x300 or 90,000 pixels per square inch.

Enter a negative number in the template to have the OMR region extracted to a separate image file whenever the threshold is met. This feature is useful for verifying and capturing signatures on documents. Use the Saved Region Filename setting to set the name for the extracted region files.

Dictionary Matching[edit | edit source]

Use Dictionary Matching to compare a list of possible field values to the field text to find a matching value. For List fields the dictionary is used to create a selection list for the field.

The Dictionary File or Field setting can be the path to a text file that contains the list of values for the dictionary, or the name of a field in one of the data source tables defined on the Database or Autofill settings.

To select a text file, use the Browse... button to select a file from your computer.

To use a database field, type the field name exactly as it appears in your database table or query. To use a field from a table or query other than the one used by Autofill or Export, enter TableName|FieldName.

You can test the dictionary settings by clicking the Test button. The first dictionary item found in the Sample Text will be shown in the Result box.

List File/Field[edit | edit source]

Enter the path to a text file containing a list of values, one on each line. The OCR zone is searched for each of these values until a match is found. This is the best way to automatically index files where the field will come from a list of known possible values that will appear somewhere on the page, but whose location may vary. This may also be used to correct for skew and other factors that can cause an OCR zone to move. Use the Set button to open a browse window to allow selection of a dictionary file.

When connected to a database, a field name may be entered here, and the unique values from that field are used for a dictionary. This may be a field from either the Data Source setting on the Database tab or the Match Data Source setting in the Autofill settings, allowing you to define a separate database for lists and export. You may also specify an alternate table for the list by using the form “TABLE|FIELD” for this setting.

It is also possible to specify multiple search values for each dictionary entry (“Thesaurus Matching”). This allows you to search for many possible matching variations on a field label and have a standard value inserted in the field. This is done by creating a pipe-separated list (“|”) of search values on each line. If any of these values is found in the search area, the first one in the list are inserted in the field.

For example, this list will find the correct state if the name, abbreviation, or any of the major cities from that state appear in the search text:

This entry will put the value “California” in the field if any of the words “California”, “ CA ”, “San Diego”, “Los Angeles” or “San Francisco” are found in the search text. Adding the space before and after “CA” ensures that the word “CAT” or “CAR” will not produce a false positive. This particular example was used to automatically classify tax documents coming from various municipalities by their state.

Another way to avoid false positives is to indicate negative keywords in the list. These are preceded by a “^”. For example:

North Carolina|Charlotte|Raleigh|Asheville|^Nashville

will prevent the word “Nashville” from matching on “Asheville” and giving a false positive. Dictionary terms are read from right to left, so place the negative keywords on the end of the line to search for the negative terms before a match is found.

When designing dictionaries, it is important to note that the values in the dictionary are searched in order until the first match is found. In order to minimize false positives, the most unique values should be placed first in the list, and values that may appear in other documents should be placed at the end.

For example, a vendor list for an invoice processing job may contain the entry “Microsoft”. However, it is likely that other invoices may contain this word as part of an item description. Hence, “Microsoft” should be placed towards the end of the vendor list.

Only Allow Values in List[edit | edit source]

Check this option to prevent users from manually entering a value that is not in the list of pre-defined values from the list file.

Max Errors[edit | edit source]

This feature lets you automatically correct for mistakes in the OCR when using dictionaries. This setting is a decimal value, usually between 0.05 and 0.30. This value is multiplied by the number of characters in the dictionary entry to determine the number of incorrect characters the field will accept.

For example, if Max Errors is set to 0.20 and the current dictionary entry is:

Simple Software

The dictionary entry has 15 characters, x 0.20 = 3 non-matching characters will be accepted for this value. This means that “5imp1e S0ftware” will also be recognized correctly as “Simple Software”. If the dictionary entry had 4 or fewer characters, all would have to be correct to consider the value a match. Be careful not to set the Max Errors percentage too high in order to prevent false positives!