PDF

From Simple Wiki

The Adobe Acrobat Portable Document Format (PDF) is the most widely used file format for electronic documents.

SimpleIndex has a number of features and settings that impact the processing of PDF files.

PDF Features & Settings[edit | edit source]

The key wiki pages for PDF processing info:

Installed PDF software like Acrobat Professional or FoxIt can be used in place of the SimpleView embedded viewer to provide digital signing, editing, form filling and other capabilities within your SimpleIndex processing workflow.

SimpleIndex produces UTF-8 encoded PDF files.

PDF Processing Overview[edit | edit source]

It is increasingly common for similar documents to be received both by paper and email. For example, many companies email invoices as PDF files, and just as many continue to send paper invoices. SimpleIndex lets you process both of these in the same workflow with the greatest speed and accuracy available.

Understanding the different types of PDF files is key to setting up the most efficient processing workflow for your files. The three types of PDF are:

  • Native PDF files generated by desktop publishing applications and PDF printer drivers, containing a mix of text and images.
  • Scanned PDF files that consist of a single embedded image for each page.
  • Searchable PDF Image+Text files that contain a scanned image for each page with a hidden text layer that can be searched or selected for copy and paste operations.

Most document imaging applications will convert PDF files to images in order to use OCR for data capture. This is a slow and CPU-intensive process with the possibility for characters to be misrecognized. Native PDF files are often exported as images, losing all of the benefits of the Native PDF format.

SimpleIndex has the ability to detect which of the three PDF formats each file is in and perform OCR only on files that don't contain text. Once the text is read from the PDF, data can be extracted using Template and Dictionary matching, Regular Expressions, or by indicating line and column coordinates in the text.

With Native PDF files the extracted data is 100% accurate, and with searchable PDF files you don't need to re-do the OCR. SimpleIndex can process hundreds of pages per minute this way, many times faster than OCR.

SimpleIndex PDF Indexing Demo Video[edit | edit source]

Video was recorded in a previous version of SimpleIndex. Refer to the wiki documentation for latest updates.

Related Knowledge Base Articles[edit | edit source]