Duplicate Page Detection

From Simple Wiki

SimpleIndex has the ability to use Full-Page OCR with the Autofill feature to perform advanced, content-based duplicate detection.

Background[edit | edit source]

Most applications to detect duplicate files will only detect exact copies of a digital file. They use something similar to the %FILEHASH% Fixed Field to generate a unique "fingerprint" value for each file that can be used to quickly detect any exact copies.

But what if the original page was copied and scanned a second time? In this case the digital fingerprint will be unique, due to the slight variations you get in each copy or scan. The Full-Page OCR text will also not be an exact match, since stray marks and other variations can be read differently. In these cases, a Fuzzy Matching algorithm is necessary to identify duplicate pages while accounting for the minor differences in Full-Page OCR text.

Setup and Processing Workflow[edit | edit source]

In summary, the steps to configure and process a Job File for duplicate detection are:

  1. Create a Job File that performs Full-Page OCR on single page files
  2. Create key fields that identify unique keywords or phrases on each page
  3. Configure the Database Export to create a table with a record for each page containing the key values and page text
  4. Configure the Autofill feature to match on the exported key values
  5. Set the Fuzzy Text Matching % value to indicate the maximum percentage of differences in the text
  6. Create an Autofill field and map it to the IMAGEPATH field, or some other unique field

When you run the job, pages with duplicate text will populate the Autofill field with a value. If no duplicate is found, the value will be blank. If you use the IMAGEPATH field as the value, it will indicate the path to the duplicate file for reference. Operators can then review and delete duplicate pages as part of their workflow, or duplicates can be automatically sorted into a separate folder for review.

Performance[edit | edit source]

Duplicate detection is most important when there are a lot of pages. However, performing a fuzzy text match on thousands or millions of pages would take several minutes per page.

If each page has some unique value or combination of values that can be extracted with Template or Dictionary matching, then content-based duplicate detection is not really necessary. You can just extract these values from each page and Autofill match them against the previously exported values. Content-based detection is necessary when there are a many different types of files mixed together, such as backfile conversions or litigation support.

To reduce the number of fuzzy matching comparisons needed to identify duplicate pages, you need to capture multiple terms to an Autofill key field that are based on generic Template matches that identify some random word or phrase on each page that will be the same for duplicate pages but unique for different ones. SimpleIndex now has a Template preset, %DUPID%, that will find the 3rd seven-letter, the 4th five-letter, and third 6-letter word from the bottom. These terms are added to a Matching field separated by a dash such as "Accountant-Order-Credit" in a database. As subsequent batches are scanned, the same reference terms are captured and compared to the Matching field and if a match is present, the user is notified.

If your documents have a lot of standard forms that generate the same values on every page you may adjust this to use more complex patterns. The goal is to ensure that most pages will have fewer than 100 matches on any unique combination of key values across the full dataset, which allows the fuzzy matching to be performed without a noticeable difference in processing speed.

Configuration Details[edit | edit source]

To correctly configure the duplicate detection function, consider these options:

  • If there are multiple files with matching text, you may or may not care about which file is referenced. Check the Automatically select first matching record option on the Autofill settings to suppress this dialog and select the first match automatically.
  • In the initial release of this feature, the Fuzzy Text Matching % setting is not available on the Autofill settings screen. To configure this setting, open the Job File in Notepad and search for the MATCH_TEXT value. Set the value to a decimal number between 0 and 100, indicating the percentage of non-matching characters that will be allowed to consider the page text a match. A value of 10 will generally allow for most typical OCR differences without producing false positive matches, but a lower value may be needed if you have many pages using standard forms that only differ by a few field values.
  • The Database settings must be configured to use Insert mode to export a record for each page. The Full Text OCR field and Output File Field should both be mapped, as well as the key fields.
  • The duplicate detection does not have to be performed in the same job as the Import and OCR steps. In this case the duplicate detection job can set the Database mode to Disabled, but the Full Text OCR field must still be configured. Temporarily set the Database mode to Insert to enter a value in this setting, then set it back to Disabled when finished.

SimpleIndex Video[edit | edit source]

SimpleIndex is a constantly changing and updating software. Refer to the wiki documentation for latest updates.