|
|
Line 1: |
Line 1: |
| It is increasingly common for similar documents to be received both by paper and email. For example, many companies email invoices as [[PDF files]], and just as many continue to send paper invoices. SimpleIndex lets you process both of these in the same workflow with the greatest speed and accuracy available.
| | #REDIRECT [[PDF]] |
| | |
| Understanding the different types of [[PDF]] files is key to setting up the most efficient processing workflow for your files. The three types of PDF are:
| |
| | |
| * Native PDF files generated by desktop publishing applications and PDF printer drivers, containing a mix of text and images.
| |
| | |
| * Scanned PDF files that consist of a single embedded image for each page.
| |
| | |
| * [[Searchable PDF]] Image+Text files that contain a scanned image for each page with a hidden text layer that can be searched or selected for copy and paste operations.
| |
| | |
| Most document imaging applications will convert PDF files to images in order to use [[OCR]] for [[data capture]]. This is a slow and CPU-intensive process with the possibility for characters to be misrecognized. Native PDF files are often exported as images, losing all of the benefits of the Native PDF format.
| |
| | |
| SimpleIndex has the ability to detect which of the three PDF formats each file is in and perform [[OCR]] only on files that don't contain text. Once the text is read from the PDF, data can be extracted using [[Template]] and [[Dictionary matching]], [[Regular Expressions]], or by indicating line and column coordinates in the text.
| |
| | |
| With Native PDF files the extracted data is 100% accurate, and with searchable [[PDF]] files you don't need to re-do the [[OCR]]. SimpleIndex can process hundreds of pages per minute this way, many times faster than [[OCR]].
| |