SimpleIndex - Duplicate Page Detection Video

From Simple Wiki

SimpleIndex: Duplicate Page Detection[edit | edit source]

Most applications to detect duplicate files will only detect exact copies of a digital file. They use something similar to the %FILEHASH% Fixed Field to generate a unique "fingerprint" value for each file that can be used to quickly detect any exact copies.

But what if the original page was copied and scanned a second time? In this case the digital fingerprint will be unique, due to the slight variations you get in each copy or scan. The Full-Page OCR text will also not be an exact match, since stray marks and other variations can be read differently. In these cases, a Fuzzy Matching algorithm is necessary to identify duplicate pages while accounting for the minor differences in Full-Page OCR text.

Setup and Processing Workflow: In summary, the steps to configure and process a Job File for duplicate detection are:

1) Create a Job File that performs Full-Page OCR on single page files

2) Create key fields that identify unique keywords or phrases on each page

3) Configure the Database Export to create a table with a record for each page containing the key values and page text

4) Configure the Autofill feature to match on the exported key values

5) Set the Fuzzy Text Matching % value to indicate the maximum percentage of differences in the text

6) Create an Autofill field and map it to the IMAGEPATH field, or some other unique field

7) When you run the job, pages with duplicate text will populate the Autofill field with a value. If no duplicate is found, the value will be blank. If you use the IMAGEPATH field as the value, it will indicate the path to the duplicate file for reference. Operators can then review and delete duplicate pages as part of their workflow, or duplicates can be automatically sorted into a separate folder for review. This guide outlines how to set up and use SimpleIndex's duplicate page detection feature to identify and manage redundant scans in your archive.

1. Access SimpleIndex and Job Settings [02:39][edit | edit source]

  • Open SimpleIndex and load the "dupe detection" job.
  • Navigate to the settings to configure the job.

2. Configure File Handling [02:52][edit | edit source]

  • Split Multi-Page Files: Ensure this setting is enabled, as detection analyzes single pages.
  • Separate File Output: Confirm pages are outputted separately to ensure each has unique keywords for comparison.

3. Database Setup [03:22][edit | edit source]

  • Database Connection: Connect to a database (e.g., Microsoft Access, SQL).
  • Output File Field: Set this to "image path" to store the file's location for duplicate referencing.

4. Indexing and File Naming [03:58][edit | edit source]

  • Document Type: Include a document type for file naming (customizable).
  • Unique Identifiers: Use a batch ID and auto-number field to ensure unique file names for outputted documents .

5. OCR and Keyword Extraction [04:35][edit | edit source]

  • Full Page OCR: Enable full-page OCR to read and extract text from every page.
  • Define Keyword Template: Use a template (e.g., %dup ID%) with a regular expression to find specific keywords (e.g., the third seven-letter word, the fourth five-letter word, and the first six-letter word from the bottom). These keywords combine to form a unique page identifier.

6. Autofill and Duplicate Comparison [05:51][edit | edit source]

  • Match Value: The system generates a "match value" from the extracted keywords.
  • Database Comparison: This match value is compared against a "match field" in your database.
  • Possible Duplicate Notification: If a match is found, the path to the existing duplicate file in the archive is displayed to the operator.

7. Running the Duplicate Detection Job [06:44][edit | edit source]

  • Start Job: Initiate the SimpleIndex job.
  • OCR Processing: SimpleIndex performs full-page OCR and extracts keywords from each file.
  • Review and Save: Enter a document type, review the extracted match value, and save processed files.

8. Processing Subsequent Batches and Identifying Duplicates [10:13][edit | edit source]

  • Add New Files: Place new files into the input folder.
  • Run Job Again: Execute the SimpleIndex job.
  • Identify Duplicates: If a "possible duplicate" is found, the path to the previously scanned file is displayed.
  • Operator Action: The operator can compare the documents, cancel the job for all duplicates, or flag specific duplicates for deletion.
  • Release Batch: Release the batch once decisions are made.
  • Database Update: The database updates with new unique records, ensuring unique files in the output folder.