Screen Scraping is a commonly used method for transferring data from one application to another by using OCR to read text from the application window.
This is used to changed the default OCR recognition font or type from the default, which is “To Be Detected”. This can be used to look for a specific type of OCR font and is especially useful for recognizing things like Dotmatrix, OCR A and OCR B.
Instructions for setting OCR Font:
1. Right click on the .sic file and select Open With a text editor (Notepad, Wordpad, etc.)
2. Find <OCR_TEXT_TYPE>. If you can’t find <OCR_TEXT_TYPE> then add the following as the last row in the text file:
3. Change the number in between: <OCR_TEXT_TYPE>#</OCR_TEXT_TYPE>
4. Number of desired font:
- 0 Normal
- 1 Typewriter
- 2 Dotmatrix
- 3 Index
- 5 OCR A
- 6 OCR B
- 7 MICR E13B
- 8 MICR CMC7
- 9 Gothic
- 10 To Be Detected
5. Close and save file
SimpleIndex uses the .NET regular expressions library.
For more information see the Regular Expressions Wiki Page.
Is there a way to just use part of a bar code or OCR value? For example, extract “50” from the value “124450”
To do this example, create a barcode field (Field 1 for example) and a 2nd field with type “Fixed”. In the template for the 2nd field, enter %FIELD1[5,2]% to get “50” from “124450”.
%FIELD1% would get the entire value for Field #1, the barcode field. By adding the [5,2] you tell SimpleIndex to start at the 5th character (5) and take 2 characters from the value (50).
Training has been removed with version 7 due to the addition of the ABBYY FineReader OCR engine.
There are several things you can do to improve accuracy for OCR.
- Scan at 300dpi, black & white for best results.
- Adjust the scan settings to remove background noise and improve the definition of characters.
- For Zone OCR, field recognition can often vary based on the surrounding white space and text in the zone. Try varying the size of the zone to achieve optimal results.
- For template matching, make sure all variations of the field format are included in the template list.
- For dictionary matching, add common variations and OCR mistakes to the “thesaurus” list.
- On the Zones & OCR tab (accessed from the Job Options) you can adjust the Max Errors setting to allow for more mistakes in the dictionary matching process.
- Use the Strip Spaces, Strip Characters, Replace Characters and Case Fixing options to standardize the field format prior to matching.
Please refer to the SimpleIndex Wiki for details on how to configure these options.
- SimpleIndex.com – Zone OCR
- SimpleIndex.com – Dynamic OCR
- SimpleOCR.com – OCR Guide
- SimpleIndex Wiki – OCR
- SimpleIndex Wiki – OCR Options
- SimpleIndex Wiki – Zone OCR
- SimpleIndex Wiki – Full Page OCR
- SimpleIndex Wiki – Zones & OCR Settings
- SimpleIndex Wiki – OCR to Field
- SimpleIndex Wiki – OCR Text View
- SimpleIndex Wiki – Template & Dictionary Matching OCR
- SimpleIndex Wiki – OMR and OCR Document Separation
Some documents are difficult or impossible to automate with OCR. For example, documents with non-standard layouts, unconstrained handwriting or very poor scan quality. In applications like invoice processing, fully automating the data entry can require expensive software and weeks of consulting. Even after all that expense, many users miss the interface and data validations that their accounting software entry screens provide.
In cases like this, SimpleIndex can help improve data entry efficiency while archiving your scanned originals at the same time. Here’s how it works:
- Scan a batch of documents for data entry
- Place the SimpleIndex window side-by-side with your data entry window
- Enter the data normally, reading from the scanned image in SimpleIndex
- Press the hotkey combo to transfer the data to SimpleIndex
- Save the image and repeat with the next one
In this configuration, SimpleIndex captures an image of the data entry window, then uses OCR to read the data and index the image. Since the data entry screen has a consistent layout and clear, readable fonts, it can be reliably recognized with OCR.
There are several advantages to this approach:
- Configuration and training takes hours not weeks
- Scanned images are indexed with no extra work
- All the advantages of digital docs–security, searching, sharing, etc.
- Use all the data validation features of your software
- No flipping through paper documents
- Operator keeps eyes on the screen and hands on the keyboard
- Data entry can be done remotely
- Data entry performance improves and files are archived at the same time
These videos demonstrate several ways SimpleIndex® can automatically index different types of documents. If you are new to SimpleIndex, watching these videos is the easiest way to see what it can do. You can follow along using the sample files included in the SimpleIndex Trial.
- Zone OCR with template matching
- Document barcode recognition
- PDF OCR text parsing
- Sort and index MS Office documents
- Indexing with full-text OCR
- Running jobs from an icon
The sample files are copied to your Configuration Folder when you run the SimpleIndex Trial for the first time. If you can’t find the samples, copy them with the Global Settings Wizard in the File menu.
Compare Major Scanning Solutions
Integrated Solutions Built with SimpleIndex
Zone OCR with Template Matching
Document Barcode Recognition
PDF OCR Text Parsing
Sort and Index MS Office Documents
Full Page OCR Invoice Processing
KB Articles for Optical Character Recognition
- Language Pack for Standard/Tesseract OCR
- Languages Supported in SimpleSoftware OCR Engines
- What is Document Imaging?
- Change the Dictionary Separator Value
- Change the OCR Font or Type
- Regular Expression (RegEx) - Syntax or Type
- Autonumber Increment Value
- I'm using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- If I have a form which is filled manually by hand, can SimpleIndex read the data from it?