Enhanced Text Parsing & PDF Support
MS Office and PDF text parsing features are now included in the Basic version of SimpleIndex, making it much more affordable to enable automatic document sorting on the desktop. Additional Office and PDF features include:
- Convert any MS Office, HTML, XML and image files to PDF before processing
- Read and write password protected PDF file
- Searchable PDF output (Image + Hidden Text)
- Interactive template builder and tester
- Easily select PDF or PDF/A output format
- Native PDF viewer and auto-repair of problematic PDFs
- Read data from PDF forms
- Populate blank PDF forms with index data
TaxStacker: Sort & Classify Federal Tax Documents
This is a great way for accountants and tax preparers to organize complex tax returns in a way that makes it easy to find specific documents. It can also be used to ensure all required schedules and supporting documents are present in the finished return.
Use our out-of-the-box TaxStacker configuration to automatically identify all the forms and schedules that make up a U.S. federal income tax return. These can then be sorted into separate PDF files or combined into a single file that has bookmarks to indicate each section.
Check and Repair All PDF Files
You can set SimpleIndex to assume that it needs to check every PDF file and fix it.
Go to this location in the Windows Registry:
Computer\HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\SimpleIndex\Misc
Create a New String Value called “FixAllPDF” and set the value to 1
Keep Pages in Original Order when Bookmarking
If you want to keep all the pages in the same order that they were imported, even though they all go with different bookmarks then do the following.
1. Open the configuration in Notepad.
2. Search for <BOOKMARK_PAGE_ORDER>
3. Change this line from “false” to “true”: <BOOKMARK_PAGE_ORDER>true</BOOKMARK_PAGE_ORDER>
4. Save and close.
Do Not Combine Pages to 1 Bookmark
If you want to keep pages in bookmarks separate instead of combining them into a single bookmark when the same bookmark value is found in several interspersed images in the batch do the following:
1. Open the Job Configuration file in Notepad.
2. Search for this value: <BOOKMARK_PDF1>
3. Enter this directly above the line that has <BOOKMARK_PDF1> if its not already there: <BOOKMARK_UNIQUE_LEVELS>-1</BOOKMARK_UNIQUE_LEVELS>
4. -1 is the default value and that means that no pages should be combined into one bookmark unless they fall in order. 0 means that the first bookmark level should be combined into one bookmark value and the rest should not. 1 means that the first and second bookmark levels should be combined and the rest should not be. ETC.
I have a scanner/copier that creates PDF and TIFF files and saves them to my file server. Can I use SimpleIndex to create a searchable CD/DVD from these files?
This feature is included in SimpleIndex at no additional cost and is called the Media Wizard.
The Media Wizard is located in the “Send” file menu and is called Media Wizard. It allows you to burn your images, indexes, a database and a free SimpleSearch viewer for just the CD or DVD. It also provides an easy way to get the maximum amount of information on the media that you want it on.
You set up the Media Wizard by pointing it to your image folder and database and you select the media that you would like to put it on. It then saves a file folder with all the files that you would need in the size of the media you are using in the location that you designate. You then burn these files using the burning application of your choice.
- Published in Database & Retrieval, Import
How do you configure OCR to read index information from MS Office or PDF documents?
MS Office and PDF files generated by software or PDF printer drivers already have the text you need to recognize in the file. Scanned documents need to use OCR to read text from an image of the page. With Office and PDF files, SimpleIndex can just read the text, which is much faster and accurate than image OCR.
To recognize index fields from the document text, first create OCR fields on the Index tab as you would normally. Next, on the Zones & OCR options tab, check the “Use Full Page OCR for this Field” option for each OCR field. This tells SimpleIndex to process the existing file text.
If the index value is a unique pattern of digits or list of possible values, use Template or Dictionary matching to locate the value within the text. Please see the manual for details on Template and Dictionary matching.
If the value appears in a specific location in each file, coordinates can be used to locate it. When processing text, the X, Y, Width and Height settings correspond to line and column numbers within the file text. This is explained in greater depth in the manual.
SimpleIndex will assume that any TXT file with the same name as a file being processed is the OCR text for that file, so this method can work with any type of file.
Find out more about Optical Character Recognition on the SimpleOCR Guide.
- Published in OCR, Office PDF Text Processing
Imprint, Endorse, or Bates Stamp Images Electronically
Many legal applications require documents to have a sequential number, called a bates stamp, printed in a specific location on each page. Usually this requires the purchase of a much more expensive scanner that has a built-in printer, called an imprinter or endorser to print the number on the pages as they are scanned. However, if documents are being submitted electronically, the bates stamp does not have to be physically printed on the page. SimpleIndex‘s Electronic Imprinting feature lets you apply the bates stamp to the images after they are scanned, saving you thousands on specialized scanning hardware.
New in SimpleIndex v9.0 you can now apply image-based semi-transparent watermarks for PDF files to add logos, trademarks or backgrounds.
Electronic imprinting is also useful for anyone looking to apply a page numbering scheme, scan date, copyright notice or any other text to document images as they are scanned.
Patent ID and Title Extraction
To avoid manual data entry, pattern matching is used to automatically read the Patent ID Number and Title from any US patent application straight out of the box.
This job configuration is available for free for SimpleIndex users from the link below.
SimpleIndex Patents Demo Job Configuration
KB Articles for Patent ID and Title
- Languages Supported in SimpleSoftware OCR Engines
- What is Document Imaging?
- Change the Dictionary Separator Value
- Change the OCR Font or Type
- Regular Expression (RegEx) - Syntax or Type
- Autonumber Increment Value
- I'm using full page OCR. The information is all appearing in the txt file but it is losing format about half way through. Data to the right is ending up at the end of the txt doc. Can this be fixed?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- If I have a form which is filled manually by hand, can SimpleIndex read the data from it?
- How do you train the OCR engine for better accuracy?
MSDS Material Safety Data Sheets Indexing
KB Articles for MSDS
- Change the Dictionary Separator Value
- Regular Expression (RegEx) - Syntax or Type
- Check and Repair All PDF Files
- Keep Pages in Original Order when Bookmarking
- Do Not Combine Pages to 1 Bookmark
- Can I split a PDF based on bookmark values?
- Is it possible to search for and retrieve documents with Windows desktop search?
- Can SimpleIndex read bar codes from existing PDF files?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- How do you configure OCR to read index information from MS Office or PDF documents?
SimpleInvoice Invoice Processing Solution
SimpleInvoice is a preconfigured solution that uses the OCR and dictionary matching functionality of the SimpleIndex scanning and indexing software to automatically scan, name, and organize incoming invoices into your chosen folder structure of searchable PDF files.
SimpleInvoice requires minimal configuration to get started. It comes with everything you need to index most common invoice styles. The customer and vendor lists, as well as your particular Purchase Order and Invoice number styles, can be customized for your company.
Use SimpleInvoice to:
- Automatically receive and enter invoices in your accounting software, especially QuickBooks
- Create full-text searchable invoice files
- Create an organized filing system for archiving invoices
- Quickly find specific invoices based on vendor, date, invoice number and other index fields
Please Contact Us to find out more about SimpleInvoice!
Template Autofill
This feature added in SimpleIndex 8.1 allows you to spell out the specific OCR pattern of a vendor’s invoice number as a column in your Vendor database. When processing invoices only the template specific to that vendor is loaded and all other templates are ignored. This greatly improves the accuracy of the invoice number capture, since each vendor uses a different label and numbering scheme for their invoices.
FAQ Related to Invoice Processing
- Regular Expression (RegEx) - Syntax or Type
- Will your SimpleQB allow me to scan in old invoices or bank statements directly into QuickBooks?
- Is there a way to just use part of a bar code or OCR value? For example, extract "50" from the value "124450"
- Can SimpleQB be used to scan in receipts and invoices which are then matched to the files kept in the QuickBooks System?
- How do you train the OCR engine for better accuracy?
PDF Invoice OCR Demo
This demonstrates the PDF OCR text processing capabilities of SimpleIndex by extracting the Document Number, Date, Document Type, Customer and Total from a number of Estimates and Invoices.
All of this information is read automatically using the existing text layer of a computer generated PDF, such as those created using PDF printer drivers. Template and dictionary matching algorithms are used to locate and extract the correct data values from the text.
Since the existing text is being used, OCR is not performed. This makes processing much faster and 100% accurate. OCR can be used to get text from scanned PDF files with no existing text.
FAQ Related to PDF Invoice
Coming SoonCompare Leading Solutions
SimpleIndex™
Kodak Capture Pro™
Kofax Express™
PaperVision™ Capture Desktop
Note: This video depicts PaperVision Capture Desktop, a now discontinued software that has since been replaced by the similarly functioning updated version of PaperFlow.
Office Gemini DiamondVision™
Testing Methods
The benchmark times were recorded using all available software shortcuts, and by performing data entry and user interactions as fast as possible. The same scanner and computer hardware was used for each test. Much care was taken to ensure that each application yielded the most accurate OCR results possible given the sample documents. Unfortunately none our competitors could accurately capture the account number on all 10 pages. The extra time to correct these errors accounts for 15-30% of the difference in processing times. The difference in accuracy is due in large part to SimpleIndex‘s pattern matching OCR feature, which the other programs lack.