PDF: Difference between revisions

From Simple Wiki
No edit summary
No edit summary
Line 1: Line 1:
The Adobe Acrobat Portable Document Format (PDF) is the most widely used file format for electronic documents.
The Adobe Acrobat Portable Document Format (PDF) is the most widely used file format for electronic documents.


SimpleIndex has a number of features and settings that impact the processing of PDF files. The key pages for these are:
SimpleIndex has a number of features and settings that impact the processing of PDF files.  
 
== PDF Features & Settings ==
 
The key pages for these are:


* [[MS Office and PDF Processing]] Features
* [[MS Office and PDF Processing]] Features
Line 13: Line 17:


Installed PDF software like Acrobat Professional or FoxIt can be used in place of the [[SimpleView]] embedded [[viewer]] to provide digital signing, editing, form filling and other capabilities within your SimpleIndex processing workflow.
Installed PDF software like Acrobat Professional or FoxIt can be used in place of the [[SimpleView]] embedded [[viewer]] to provide digital signing, editing, form filling and other capabilities within your SimpleIndex processing workflow.
== PDF Processing Overview ==
It is increasingly common for similar documents to be received both by paper and email. For example, many companies email invoices as [[PDF files]], and just as many continue to send paper invoices. SimpleIndex lets you process both of these in the same workflow with the greatest speed and accuracy available.
Understanding the different types of [[PDF]] files is key to setting up the most efficient processing workflow for your files. The three types of PDF are:
* Native PDF files generated by desktop publishing applications and PDF printer drivers, containing a mix of text and images.
* Scanned PDF files that consist of a single embedded image for each page.
* [[Searchable PDF]] Image+Text files that contain a scanned image for each page with a hidden text layer that can be searched or selected for copy and paste operations.
Most document imaging applications will convert PDF files to images in order to use [[OCR]] for [[data capture]]. This is a slow and CPU-intensive process with the possibility for characters to be misrecognized. Native PDF files are often exported as images, losing all of the benefits of the Native PDF format.
SimpleIndex has the ability to detect which of the three PDF formats each file is in and perform [[OCR]] only on files that don't contain text. Once the text is read from the PDF, data can be extracted using [[Template]] and [[Dictionary matching]], [[Regular Expressions]], or by indicating line and column coordinates in the text.
With Native PDF files the extracted data is 100% accurate, and with searchable [[PDF]] files you don't need to re-do the [[OCR]]. SimpleIndex can process hundreds of pages per minute this way, many times faster than [[OCR]].

Revision as of 15:25, 17 January 2022

The Adobe Acrobat Portable Document Format (PDF) is the most widely used file format for electronic documents.

SimpleIndex has a number of features and settings that impact the processing of PDF files.

PDF Features & Settings[edit | edit source]

The key pages for these are:

Installed PDF software like Acrobat Professional or FoxIt can be used in place of the SimpleView embedded viewer to provide digital signing, editing, form filling and other capabilities within your SimpleIndex processing workflow.

PDF Processing Overview[edit | edit source]

It is increasingly common for similar documents to be received both by paper and email. For example, many companies email invoices as PDF files, and just as many continue to send paper invoices. SimpleIndex lets you process both of these in the same workflow with the greatest speed and accuracy available.

Understanding the different types of PDF files is key to setting up the most efficient processing workflow for your files. The three types of PDF are:

  • Native PDF files generated by desktop publishing applications and PDF printer drivers, containing a mix of text and images.
  • Scanned PDF files that consist of a single embedded image for each page.
  • Searchable PDF Image+Text files that contain a scanned image for each page with a hidden text layer that can be searched or selected for copy and paste operations.

Most document imaging applications will convert PDF files to images in order to use OCR for data capture. This is a slow and CPU-intensive process with the possibility for characters to be misrecognized. Native PDF files are often exported as images, losing all of the benefits of the Native PDF format.

SimpleIndex has the ability to detect which of the three PDF formats each file is in and perform OCR only on files that don't contain text. Once the text is read from the PDF, data can be extracted using Template and Dictionary matching, Regular Expressions, or by indicating line and column coordinates in the text.

With Native PDF files the extracted data is 100% accurate, and with searchable PDF files you don't need to re-do the OCR. SimpleIndex can process hundreds of pages per minute this way, many times faster than OCR.