Dictionary Matching

From Simple Wiki

Back to Index Field Wizard

Use Dictionary Matching to compare a list of possible values to the field OCR text to find a matching value.

For List fields, the dictionary is used to create a selection list for the field.

Selecting a List[edit | edit source]

SimpleIndex Simple Setup Configuration Index Field Wizard Dictionary Matching
Dictionary Matching Settings

The Dictionary File or Field setting can be the path to a text file that contains the list of values for the dictionary, or the name of a field in one of the data source tables defined on the Database or Autofill settings.

To select a text file, use the Browse... button to select a file from your computer.

When connected to a database, a field name may be entered here, and the unique values from that field are used for the dictionary or list. This may be a field from either the Data Source setting on the Database tab or the Match Data Source setting in the Autofill settings, allowing you to define a separate database for lists and export. You may also specify an alternate table for the list by using the form “TABLE|FIELD” for this setting.

List Formatting[edit | edit source]

The list file or field should contain a list of values, one on each line or row. The OCR zone is searched for each of these values until a match is found. This is the best way to automatically index files where the field will come from a list of known possible values that will appear somewhere on the page, but whose location may vary.

It is also possible to specify multiple search values for each dictionary entry (“Thesaurus Matching”). This allows you to search for many possible matching variations on a field label and have a standard value inserted in the field. This is done by creating a pipe-separated list (“|”) of search values on each line. If any of these values is found in the search area, the first one in the list are inserted in the field.

For example, this list will find the correct state if the name, abbreviation, or any of the major cities from that state appear in the search text:

California| CA |San Diego|Los Angeles|San Francisco
New York| NY |Albany|Niagra|White Plains
Texas| TX |Dallas|Houston|San Antonio
Georgia| GA |Atlanta|Macon|Savannah

This entry will put the value “California” in the field if any of the words “California”, “ CA ”, “San Diego”, “Los Angeles” or “San Francisco” are found in the search text. Adding the space before and after “CA” ensures that the word “CAT” or “CAR” will not produce a false positive. This particular example was used to automatically classify tax documents coming from various municipalities by their state.

Another way to avoid false positives is to indicate negative keywords in the list. These are preceded by a “^”. For example:

North Carolina|Charlotte|Raleigh|Asheville|^Nashville

will prevent the word “Nashville” from matching on “Asheville” and giving a false positive.

Dictionary terms are read from right to left, so place the negative keywords on the end of the line to search for the negative terms before a match is found.

Just as the "|" separator in the dictionary can match a list entry based on one phrase or another, the "&&" can be used to match one phrase AND another. For example:

New York City|Bronx&&Manhattan&&Booklyn&&Queens&&Staten Island

will only match New York city if all 5 of the boroughs (Bronx, Brooklyn, Manhattan, Queens and Staten Island) are found.

Dictionary entries are searched in order from the first line or row until a match is found. In order to minimize false positives, the most unique values should be placed first in the list, and values that may appear in other documents should be placed at the end. Search stops as soon as the first match is found.

For example, a vendor list for an invoice processing job may contain the entry “Microsoft”. However, it is likely that other invoices may contain this word as part of an item description. Hence, “Microsoft” should be placed towards the end of the vendor list.

With a text file you can simply move the lines around to put them in the desired order. To order a database you should add a numeric column that indicates the sort order and sort the view by that column.

Only Allow Values in List[edit | edit source]

Check this option to prevent users from manually entering a value that is not in the list of pre-defined values from the list file.

Max Errors[edit | edit source]

This feature lets you automatically correct for mistakes in the OCR when using dictionaries. This setting is a decimal value, usually between 0.05 and 0.30. This value is multiplied by the number of characters in the dictionary entry to determine the number of incorrect characters the field will accept.

For example, if Max Errors is set to 0.20 and the current dictionary entry is:

Simple Software

The dictionary entry has 15 characters, x 0.20 = 3 non-matching characters will be accepted for this value. This means that “5imp1e S0ftware” will also be recognized correctly as “Simple Software”. If the dictionary entry had 4 or fewer characters, all would have to be correct to consider the value a match. Be careful not to set the Max Errors percentage too high in order to prevent false positives!

Testing Dictionary Matching[edit | edit source]

You can test the dictionary settings by clicking the Test button. The first dictionary item found in the Sample Text will be shown in the Result box.