Document Classification

From Simple Wiki

SimpleIndex is able to perform complex document classification tasks using a unique keyword matching system that is able to identify similar documents with different formatting without training.

Most classification systems use artificial intelligence to train a classification model based on many sample documents. While these can be easy to configure and very fast, there are a few drawbacks:

  • New examples need to be added to the training model before they can be recognized
  • Most documents must be separated into individual files before classification
  • Attachments and miscellaneous pages are not easily handled
  • Has issues with long documents like contracts, appraisals, reports, etc.

SimpleIndex is able to perform document separation and classification simultaneously, making it particularly suited to documents like mortgage loans that often arrive as one large PDF but contain hundreds of individual documents.

SimpleIndex also has a unique Post-Process workflow that lets you automatically execute document-specific workflows after classification.

Classifying Documents[edit | edit source]

Document Classification in SimpleIndex can be performed in a number of ways, depending on the number of document types you have and the types of layout.

If you have control over the document layout, adding a barcode corresponding to the document type is the fastest and most reliable way to classify those documents after scanning. Creating an OCR zone with a text-based document ID is a bit slower and less accurate with scanned images, but just as fast when processing electronic files like PDF and Office documents.

Use Dictionary Matching to classify documents with many different document types. This method matches unique key phrases in each document to a master list. For best results, use the following guidelines:

  • Use unique phrases with 3-5 words that will never appear on other document types.
  • Identify the first page of each document only. Unidentified pages can be appended automatically.
  • Use the && operator to match on multiple short keywords if longer phrases are not available.
  • Use the ^ operator to specify "negative" keywords when false positive matches can't be avoided.


Processing Classified Documents[edit | edit source]

Classification is usually just the first step when processing documents. Once you have identified which document is which, you can now perform more complex data extraction specific to each type.

In the Post-Process setting, enter the keyword %CLASSIFY% to automatically execute a job for each document type once classification is complete. This makes it easy for a single operator to classify and index various types of documents in one easy workflow, or make unattended processing for many document types much easier.

The post process will search the Output folder for subfolders that contain files, then execute a job file that matches the folder name if it exists. The job files can be stored in the main configuration file folder or the root of the Output folder, as long as the job file name matches the subfolder name.

Steps to Configure Automatic Classification and Indexing[edit | edit source]

In summary, the steps to create and process multiple document types in a single workflow:

  1. Create a classification job file that identifies document types and saves them to subfolders of the Output folder
  2. Enter %CLASSIFY% in the Post-Process setting for the that job
  3. Create a document processing job file for each subfolder/document type
  4. Run the classification job. Any identified documents will be saved to subfolders and then the corresponding job for each will launch in sequence.

Shared Job Settings Files[edit | edit source]

You can also create a "Default" job settings file that contains the settings shared between all of the jobs, making it easy to update common paths or connection strings in dozens of jobs at once. If there is a job with the same name as the Output folder root, it will be used as the default job. The other job files only need to contain the XML for the settings that are different, usually the index fields. Edit the job files in Notepad to remove the shared settings elements.

Managing Multiple Server Jobs[edit | edit source]

When processing jobs in Server mode, each job needs to be added to the list and given its own schedule, and only a limited number can be run concurrently.

This limitation can be overcome by using a single job with the %CLASSIFY% post-process command. This job does not need to have any files to process, it just points to an Output folder with subfolders for each job that needs to run. Add this job to the Server manager on your desired schedule and it will execute all of the other jobs automatically each time it runs.

If the subfolders are used as the Input folder for each job, then the jobs will only run when the subfolder contains files to process. If the Input folder uses another path, you must place some file in the subfolder to ensure the job is launched each time. You can also remove the file to disable that job.

Configuring your Server environment this way makes processing more efficient and easier to manage. Only the jobs that need to run are launched on each processing interval, and jobs can be added or removed from the process by creating or deleting the corresponding subfolder.

SimpleIndex Document Classification Video[edit | edit source]