PDF Processing: Difference between revisions

VisualWikitext

Latest revision as of 19:25, 17 January 2022

Redirect to:

PDF

Redirect to:

PDF

@@ Line 1: / Line 1: @@
-It is increasingly common for similar documents to be received both by paper and email. For example, many companies email invoices as [[PDF files]], and just as many continue to send paper invoices. SimpleIndex lets you process both of these in the same workflow with the greatest speed and accuracy available.
+#REDIRECT [[PDF]]
-Understanding the different types of [[PDF]] files is key to setting up the most efficient processing workflow for your files. The three types of PDF are:
-* Native PDF files generated by desktop publishing applications and PDF printer drivers, containing a mix of text and images.
-* Scanned PDF files that consist of a single embedded image for each page.
-* [[Searchable PDF]] Image+Text files that contain a scanned image for each page with a hidden text layer that can be searched or selected for copy and paste operations.
-Most document imaging applications will convert PDF files to images in order to use [[OCR]] for [[data capture]]. This is a slow and CPU-intensive process with the possibility for characters to be misrecognized. Native PDF files are often exported as images, losing all of the benefits of the Native PDF format.
-SimpleIndex has the ability to detect which of the three PDF formats each file is in and perform [[OCR]] only on files that don't contain text. Once the text is read from the PDF, data can be extracted using [[Template]] and [[Dictionary matching]], [[Regular Expressions]], or by indicating line and column coordinates in the text.
-With Native PDF files the extracted data is 100% accurate, and with searchable [[PDF]] files you don't need to re-do the [[OCR]]. SimpleIndex can process hundreds of pages per minute this way, many times faster than [[OCR]].