PDF Text Processing Demo

This sample job demonstrates the pdf text processing capabilities of SimpleIndex by extracting the Document Number, Date, Document Type, Customer and Total from a number of documents without OCR, by processing the text layer of PDF files.

Computer-generated PDF files, such as those created using PDF printer drivers, already contain digitized text. SimpleIndex reads the text and performs Template and Dictionary Matching to locate and extract the correct data values from the text.

Since the existing text is being used, OCR is not performed. This makes processing much faster and 100% accurate, especially compared to solutions using zone OCR.

While this demo runs interactively, text processing jobs can run in unattended mode since the data does not need to be verified.

Full-Page OCR can also be used to get text from scanned PDF files with no existing text. SimpleIndex will also detect when a PDF file has existing text and only perform OCR on the documents that need it to improve performance.