Document Classification

From Simple Wiki
Revision as of 08:07, 22 October 2022 by Aaron (talk | contribs)

SimpleIndex is able to perform complex document classification tasks using a unique keyword matching system that is able to identify similar documents with different formatting without training.

Most classification systems use artificial intelligence to train a classification model based on many sample documents. While these can be easy to configure and very fast, there are a few drawbacks:

  • New examples need to be added to the training model before they can be recognized
  • Most documents must be separated into individual files before classification
  • Attachments and miscellaneous pages are not easily handled
  • Has issues with long documents like contracts, appraisals, reports, etc.

SimpleIndex is able to perform document separation and classification simultaneously, making it particularly suited to documents like mortgage loans that often arrive as one large PDF but contain hundreds of individual documents.

SimpleIndex also has a unique Post-Process workflow that lets you automatically execute document-specific workflows after classification.

Classifying Documents[edit | edit source]

Processing Classified Documents[edit | edit source]

Classification is usually just the first step when processing documents. Once you have identified which document is which, you can now perform more complex data extraction specific to each type.

In the Post-Process setting, enter the keyword %CLASSIFY% to automatically execute a job for each document type once classification is complete.

The post process will search the Output folder for subfolders that contain files, then execute a job file that matches the folder name if it exists. The job files can be stored in the main configuration file folder, the root of the Output folder, or in each subfolder, as long as the job file name matches the subfolder name.

You can also create a "Default" job settings file that contains the settings shared between all of the jobs, making it easy to update common paths or connection strings in dozens of jobs at once. If there is a job with the same name as the Output folder root, it will be used as the default job. The other job files only need to contain the XML for the settings that are different, usually the index fields. Edit the job files in Notepad to remove the shared settings elements.