SimpleIndex - Introduction to OCR and Barcode Recognition Video
Barcodes and OCR in SimpleIndex: A Step-by-Step Guide[edit | edit source]
This guide introduces barcodes and Optical Character Recognition (OCR) in SimpleIndex, focusing on their concepts and applications.
Part 1: Barcodes[edit | edit source]
- Introduction to Barcodes
- Barcodes offer high accuracy in document processing. SimpleIndex can either read them correctly or not at all, providing clear feedback on document processing status [00:30].
- Examples of Barcode Use
- Job Configuration for Barcodes
- Configure SimpleIndex to read barcodes from an entire page. Use templates (e.g., three numbers and three letters) to ensure correct barcode values are assigned to specific index fields [02:31].
- Alternatively, define specific areas (zones) for barcode lookup. Templates are generally faster and more flexible if barcodes move on the page [03:12].
- Barcode Options
- Output with Barcodes
- Use barcode values to create folder structures and file names for organized document storage [05:44].
Part 2: Optical Character Recognition (OCR)[edit | edit source]
- Introduction to OCR
- OCR reads text directly from the page [07:55].
- Job Configuration for OCR
- Define zones to specify areas on the page for OCR to extract information like account numbers, order numbers, and company names [08:51].
- Use templates to extract specific data patterns (e.g., 7-digit account numbers, 6-digit order numbers) [09:10].
- Use dictionary matching to compare extracted text against a predefined list (e.g., a database of company names) to return an index value [09:34].
- OCR Options
- SimpleIndex offers various OCR engines, including the Abbey FineReader engine (professional engine) for standard OCR and specialized handprint recognition [10:30].
- Integration with Amazon Web Services (AWS) Textract provides advanced OCR capabilities, including handwriting (even cursive) and form/invoice processing to identify common fields [11:38].
- Output with OCR
- Use OCR-extracted values to create folder structures and file names for organizing documents [13:16].
Part 3: Processing PDFs with Text Layers[edit | edit source]
- Direct Value Reading
- SimpleIndex can directly read values from PDFs that have a text layer (born-digital documents) without needing OCR conversion [13:29].
- Templating for PDFs
- Use regular expressions for more complex pattern matching (e.g., document numbers) [14:51].
- Handle various date formats using "OR" conditions in templates [15:10].
- Embed choices directly into the template field for document types (e.g., "estimate" or "invoice") [15:35].
- Use database matching for customer names [15:56].
- Output for PDFs
- Use extracted values from PDFs to create multi-level folder structures and detailed file names [16:28].
Conclusion[edit | edit source]
Barcodes and OCR are powerful tools for automating document capture. SimpleIndex version 11 includes advanced features like duplicate document identification [16:46].