This demonstrates the PDF
OCR text processing capabilities of SimpleIndex by
extracting the Document Number, Date, Document Type, Customer and Total
from a number of Estimates and Invoices.
All of this information
is read automatically using the existing text layer of a computer
generated PDF, such as those created using PDF printer drivers. Template and dictionary matching
algorithms are used to locate and extract the correct data values from
the text.
Since the existing text is being used, OCR is not performed. This makes
processing much faster and 100% accurate. OCR can be used to get text from
scanned PDF files with no existing text.