FORGOT YOUR DETAILS?

CREATE ACCOUNT

Zone OCR and Dynamic OCR

Many document scanning solutions use Zone OCR to obtain index data from the page. SimpleIndex improves upon this time-tested but ultimately limited model with its Dynamic OCR feature. Let's look at the difference between the two methods:

Zone OCR

Zone OCR is used to read document indexes or tags from text on the page. It is a great way to automate the data entry associated with scanning documents.

However, there are several limitations to zone OCR that must be overcome:

  • Index information must be in the exact same place on every page
  • Documents shift and skew during scanning, causing the zones to not line up
  • If surrounding lines or text on the document are too close, they can encroach on the zone

 

 

 

Dynamic OCR

SimpleIndex overcomes these limitations by using Dynamic OCR technology to locate the desired text even when it moves around on the page. Our simplified version of Dynamic OCR works great for many types of documents at a fraction of the cost of other solutions.

  • Index information can appear anywhere on any page
  • Unwanted characters are automatically ignored
  • Find unique patterns of letters and numbers using Template Matching
    (Social Security #, Date, etc.)
  • Use Dictionary Matching to find a value from a list of possible values
    (Vendor Name, Document Type, etc.)

 

Download document scanning and OCR software.

Dynamic OCR Examples

Check out our demos page to see several videos showing examples of how Dynamic OCR is applied to different types of documents.

In the video we see how SimpleIndex approaches a typical Zone OCR example. With SimpleIndex you can use large zones that give a wide margin for error. Template and Dictionary matching are then used to extract the 7-digit Account Number, 6-digit Order Number and Company Name. SimpleIndex discards the surrounding text and keeps the correct value.

 

 

Another common example is finding a unique identifier, for example a social security number, that could appear anywhere on the page. Simply enter the template ###-##-#### and SimpleIndex will search the full OCR text until it finds a match. Since only one social security number is likely to appear on the page, a match on this pattern is almost certainly the required value.

With dictionary matching, you can give SimpleIndex a list of possible values and it will automatically search the zone or page for each value until it finds a match.

Many data capture applications can be implemented using these simple algorithms. This makes SimpleIndex far more versatile than other desktop OCR solutions that require the index value to be in the exact same location on every page. You get the capabilities normally reserved for Enterprise Data Capture Systems, yet SimpleIndex costs only a fraction of the price!

SimpleIndex's dynamic OCR can greatly speed up data entry by eliminating a good percentage of indexing work. For many this can put the labor cost of scanning within their reach.

Amazon AWS Textract Cloud OCR

Amazon AWS Textract Cloud OCR Batch Processing

With Textract you can capture data from almost any type of form, including handwritten ones! Textract identifies labeled text anywhere on the document and returns the label text along with the corresponding value. Map the labels to index fields in SimpleIndex and you are ready to capture that data no matter where it appears on the page.

Textract uses machine learning with a huge model based on the billions of pages processed using Textract to provide the most accurate OCR and form field extraction solution available.

By default, Textract is only available as an API and requires custom coding to integrate it into your document workflows. SimpleIndex turns it into a fully-featured document batch document and data processing app that is ready to use out-of-the-box.

Since there are no templates to configure or train, setup can be done in hours instead of days or weeks months required by other enterprise data capture solutions.

Pay-as-you-go pricing makes SimpleIndex with Textract the most affordable way to batch process forms for projects with less than 50,000 pages per year to process, especially if you need to read handwriting or have forms with many layout variations.

Wiki: How to configure AWS Textract OCR in SimpleIndex

Supports PDF and MS Office Documents

Every year more and more and more "digital born" documents are used in business workflows without the need for paper and scanningSimpleIndex's dynamic OCR works perfectly with digital documents. PDF files and MS Office documents that have digitized text are detected automatically. Template and Dictionary pattern matching can then be used to capture data from the text automatically, without the need for OCR.

Processing digital text is lightning fast and 100% accurate. SimpleIndex will even detect scanned PDF files and OCR them automatically, so you can process any kind of file with the same workflow. Find out more!

MS Office Document OCR Text Parsing Video

 

Dynamic OCR can also be applied to MS Office and PDF files, creating a fully automated process for intelligently indexing and reorganizing electronic documents.

 

Supports Regular Expressions

Use Regular Expressions to extract index data from OCR text, PDF and Office documents.

SimpleIndex OCR has a simple built-in template format, as well as support for Regular Expressions. Regular Expressions (RegEx for short) let you define complex search patterns to extract matching values from the text.  This greatly enhances the functionality of the dynamic OCR in SimpleIndex, making it capable of finding variable-length fields with no distinct pattern.

Regular Expressions are a commonly used in text parsing applications. The Perl programming language makes extensive use of RegEx, as do UNIX utilities like "grep". Many programmers and IT personnel are already familiar with RegEx and can create complex expressions without specific training.

Click here for a reference guide to Regular Expressions

Download document scanning and OCR software.

New OCR Features in Version 10

SimpleIndex 10 includes major upgrades to the OCR and Bar Code engines 

  • Amazon Textract Cloud OCR option added, with settings for Text, Forms and Invoice & Receipt extraction.
  • FineReader Engine has been upgraded to version 11. Offers improved accuracy and speed when processing large documents.
  • Full-page OCR to Word (docx), Rich Text (rtf), Open Office (odt), Excel (xlsx), PowerPoint (pptx), ePub Zip (epub), FictionBook (fb2), HTML (htm), XML (xml), Alto XML (alto.xml).
  • MRC Compression for PDF files (Mixed Raster Content).
  • OCR language pack includes all available Tesseract languages including Hindi, Tamil, Arabic, Chinese, Thai, Vietnamese, Japanese, Korean, Indonesian, Hebrew and many more.

How to Configure SimpleIndex OCR

The wiki documentation has a number of pages describing how to configure OCR for various types of documents and data capture scenarios.

 

Watch this Simple Software University training video to see how to configure and run an OCR job with SimpleIndex.

Please note: this video was recorded in a previous version. Consult the wiki documentation for the latest updates.

Download document scanning and OCR software.

KB Articles for Optical Character Recognition (OCR)

TOP
});