Document Classification: Difference between revisions

From Simple Wiki
(Created page with "SimpleIndex is able to perform complex document classification tasks using a unique keyword matching system that is able to identify similar documents with different formattin...")
 
No edit summary
Line 11: Line 11:


SimpleIndex also has a unique [[Post-Process]] workflow that lets you automatically execute document-specific workflows after classification.
SimpleIndex also has a unique [[Post-Process]] workflow that lets you automatically execute document-specific workflows after classification.
== Classifying Documents ==
== Processing Classified Documents ==
Classification is usually just the first step when processing documents. Once you have identified which document is which, you can now perform more complex data extraction specific to each type.
In the [[Post-Process]] setting, enter the keyword %CLASSIFY% to automatically execute a job for each document type once classification is complete.
The post process will search the [[Output]] folder for subfolders that contain files, then execute a job file that matches the folder name if it exists. The [[job files]] can be stored in the main configuration file folder, the root of the [[Output]] folder, or in each subfolder, as long as the [[job file]] name matches the subfolder name.
You can also create a "Default" job settings file that contains the settings shared between all of the jobs, making it easy to update common paths or connection strings in dozens of jobs at once. If there is a job with the same name as the [[Output]] folder root, it will be used as the default job. The other job files only need to contain the XML for the settings that are different, usually the index fields. Edit the job files in Notepad to remove the shared settings elements.

Revision as of 08:07, 22 October 2022

SimpleIndex is able to perform complex document classification tasks using a unique keyword matching system that is able to identify similar documents with different formatting without training.

Most classification systems use artificial intelligence to train a classification model based on many sample documents. While these can be easy to configure and very fast, there are a few drawbacks:

  • New examples need to be added to the training model before they can be recognized
  • Most documents must be separated into individual files before classification
  • Attachments and miscellaneous pages are not easily handled
  • Has issues with long documents like contracts, appraisals, reports, etc.

SimpleIndex is able to perform document separation and classification simultaneously, making it particularly suited to documents like mortgage loans that often arrive as one large PDF but contain hundreds of individual documents.

SimpleIndex also has a unique Post-Process workflow that lets you automatically execute document-specific workflows after classification.

Classifying Documents[edit | edit source]

Processing Classified Documents[edit | edit source]

Classification is usually just the first step when processing documents. Once you have identified which document is which, you can now perform more complex data extraction specific to each type.

In the Post-Process setting, enter the keyword %CLASSIFY% to automatically execute a job for each document type once classification is complete.

The post process will search the Output folder for subfolders that contain files, then execute a job file that matches the folder name if it exists. The job files can be stored in the main configuration file folder, the root of the Output folder, or in each subfolder, as long as the job file name matches the subfolder name.

You can also create a "Default" job settings file that contains the settings shared between all of the jobs, making it easy to update common paths or connection strings in dozens of jobs at once. If there is a job with the same name as the Output folder root, it will be used as the default job. The other job files only need to contain the XML for the settings that are different, usually the index fields. Edit the job files in Notepad to remove the shared settings elements.