SimpleIndex - Introduction to OCR and Barcode Recognition Video

From Simple Wiki

Barcodes and OCR in SimpleIndex: A Step-by-Step Guide[edit | edit source]

This guide introduces barcodes and Optical Character Recognition (OCR) in SimpleIndex, focusing on their concepts and applications.

Part 1: Barcodes[edit | edit source]

  1. Introduction to Barcodes
    • Barcodes offer high accuracy in document processing. SimpleIndex can either read them correctly or not at all, providing clear feedback on document processing status [00:30].
  2. Examples of Barcode Use
    • Using existing barcodes on documents for indexing [01:00].
    • Using cover sheets with barcodes for batch processing, which is especially useful for archiving back files [01:17].
  3. Job Configuration for Barcodes
    • Configure SimpleIndex to read barcodes from an entire page. Use templates (e.g., three numbers and three letters) to ensure correct barcode values are assigned to specific index fields [02:31].
    • Alternatively, define specific areas (zones) for barcode lookup. Templates are generally faster and more flexible if barcodes move on the page [03:12].
  4. Barcode Options
    • SimpleIndex includes multiple barcode engines and a "voting" capability to improve accuracy for low-quality barcodes by sampling from different engines [03:44].
    • Specify barcode types (e.g., Code 39) to speed up processing and ignore unwanted barcode types (e.g., QR codes) [04:29].
  5. Output with Barcodes
    • Use barcode values to create folder structures and file names for organized document storage [05:44].

Part 2: Optical Character Recognition (OCR)[edit | edit source]

  1. Introduction to OCR
    • OCR reads text directly from the page [07:55].
  2. Job Configuration for OCR
    • Define zones to specify areas on the page for OCR to extract information like account numbers, order numbers, and company names [08:51].
    • Use templates to extract specific data patterns (e.g., 7-digit account numbers, 6-digit order numbers) [09:10].
    • Use dictionary matching to compare extracted text against a predefined list (e.g., a database of company names) to return an index value [09:34].
  3. OCR Options
    • SimpleIndex offers various OCR engines, including the Abbey FineReader engine (professional engine) for standard OCR and specialized handprint recognition [10:30].
    • Integration with Amazon Web Services (AWS) Textract provides advanced OCR capabilities, including handwriting (even cursive) and form/invoice processing to identify common fields [11:38].
  4. Output with OCR
    • Use OCR-extracted values to create folder structures and file names for organizing documents [13:16].

Part 3: Processing PDFs with Text Layers[edit | edit source]

  1. Direct Value Reading
    • SimpleIndex can directly read values from PDFs that have a text layer (born-digital documents) without needing OCR conversion [13:29].
  2. Templating for PDFs
    • Use regular expressions for more complex pattern matching (e.g., document numbers) [14:51].
    • Handle various date formats using "OR" conditions in templates [15:10].
    • Embed choices directly into the template field for document types (e.g., "estimate" or "invoice") [15:35].
    • Use database matching for customer names [15:56].
  3. Output for PDFs
    • Use extracted values from PDFs to create multi-level folder structures and detailed file names [16:28].

Conclusion[edit | edit source]

Barcodes and OCR are powerful tools for automating document capture. SimpleIndex version 11 includes advanced features like duplicate document identification [16:46].