OMR and OCR Document Separation

From Simple Wiki

Standard Document Separation[edit | edit source]

For most jobs, document separation happens automatically as you index documents. When unique values are read via OCR or barcodes and assigned to index fields, a unique export filename is generated and the documents are separated automatically. This can be automated as long as a unique value can be read on the first page, and false positive values are not present on other pages.

However, there are some cases where it is more efficient to separate the documents into files before processing them. Particularly when documents can contain a variable number of attachments whose content is unknown, and the data being extracted doesn't have a unique pattern.

OMR Based Separation[edit | edit source]

SimpleIndex offers a unique new approach to determining where the first page of a new document starts. Traditionally, barcode separator sheets are inserted during document prep to mark the start of a new document. It is wasteful and time-consuming to insert them between each file, especially if the files are only a few pages.

SimpleIndex takes advantage of OMR technology to provide an easier solution to this problem. Simply take a felt pen and make a black mark on the upper-left corner of the first page of each new document. SimpleIndex will scan automatically to numbered multi-page files, with a new file created each time a mark is detected (separation). These files can then be indexed and exported with a second SimpleIndex job.

OCR Based Separation[edit | edit source]

SimpleIndex can also use OCR to locate the first page of a new document by finding unique keywords or patterns of text on the page. If the same page is used as the first page of each document this method can be used to identify it without additional document preparation.

If the first page is not standardized, you can create a list of possible keyword combinations that trigger the separation.

You can also use OCR based Document Classification to separate documents based on type.

An example where this could be useful is invoices that have a lot of attachments. If fields like Date and Total are read on every page, the attachments will produce many false positive values and processing will be slower. By triggering separation based on keywords like "Invoice" or "Factura" you can separate them into multipage files first, then read the invoice details from only the first page of each file.

Creating Automatic Separation Jobs[edit | edit source]

The Autonumber page describes how to configure OCR and OMR based automatic document separation.

The separation event triggers an increment in the Autonumber field, which results in a unique numbered multi-page file when exporting.

Use the Combine pages into documents after processing option to merge the pages into multipage files before starting the indexing step, letting you do separation and indexing in one job workflow. Otherwise use the Post-Process setting to execute a second job file to process the separated documents.

Checkbox Recognition with OMR Video[edit | edit source]

Video was recorded in a previous version of SimpleIndex. Refer to the wiki documentation for latest updates.

Related Knowledge Base Articles[edit | edit source]