Processing and text extraction of Microsoft Office, Adobe PDF files, HTML files and other electronic documents.
Change the Dictionary Separator Value
This is used to change the dictionary separator value when doing thesaurus matching from the default character of | to any character(s) that you want. This can be useful in cases where the values you would like in your list or dictionary might include the pipe character or “|” or “Shift Backslash”
This setting is also used as the delimiter when parsing multiple index field values from bar codes (e.g. field1|field2|field3).
Instructions for changing the dictionary separator value:
- Right click on the Job Configuration file that you would like to suppress the prompt on and select Open With>Notepad
- Search the XML settings text open in Notepad for this term:
<OCR_DICT_SEPARATOR> - Change the value in-between from “|” to any other single character that you want.
- For TAB separation use %TAB%

Check and Repair All PDF Files
You can set SimpleIndex to assume that it needs to check every PDF file and fix it.
Go to this location in the Windows Registry:
Computer\HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\SimpleIndex\Misc
Create a New String Value called “FixAllPDF” and set the value to 1
Keep Pages in Original Order when Bookmarking
If you want to keep all the pages in the same order that they were imported, even though they all go with different bookmarks then do the following.
1. Open the configuration in Notepad.
2. Search for <BOOKMARK_PAGE_ORDER>
3. Change this line from “false” to “true”: <BOOKMARK_PAGE_ORDER>true</BOOKMARK_PAGE_ORDER>
4. Save and close.
Do Not Combine Pages to 1 Bookmark
If you want to keep pages in bookmarks separate instead of combining them into a single bookmark when the same bookmark value is found in several interspersed images in the batch do the following:
1. Open the Job Configuration file in Notepad.
2. Search for this value: <BOOKMARK_PDF1>
3. Enter this directly above the line that has <BOOKMARK_PDF1> if its not already there: <BOOKMARK_UNIQUE_LEVELS>-1</BOOKMARK_UNIQUE_LEVELS>
4. -1 is the default value and that means that no pages should be combined into one bookmark unless they fall in order. 0 means that the first bookmark level should be combined into one bookmark value and the rest should not. 1 means that the first and second bookmark levels should be combined and the rest should not be. ETC.
Can I split a PDF based on bookmark values?
SimpleIndex can create PDF files with bookmarks based on the index data captured in your batch.
Going the other way–splitting an existing PDF file based on the bookmark value–is not a built-in feature of SimpleIndex. However there are inexpensive command line utilities that you can integrate with SimpleIndex in order to accomplish this.
For example, the CoolUtils PDFSplitter and A-PDF Split both offer this function starting around $35.
The command line to split the PDF can be integrated into the Pre-Process setting in SimpleIndex, found under the Advanced Settings section of the Configuration Wizard. An example pre-process using PDFSplitter to split based on the second level bookmark values would be:
PDFSplitter.exe “c:\Images\BookmarkFile.pdf” “%CONFIGFILEFOLDER%\Input” -em bookmarks -b 2
Is it possible to search for and retrieve documents with Windows desktop search?
Windows Search works great with SimpleIndex because all index data can be saved to the folder and file names as well as the file properties, and OCR text can be saved to hidden layers in PDF files. Windows Search will read all of these elements when building its index and will return any matching files when you search.
Using Windows Search on a file server allows for instantaneous searching across terabytes of documents and text for all of the users on your network.
IFilters allow Windows Search to search within file contents.
Here are three popular PDF IFilters that will enable text searching for PDF files:
- Foxit PDF IFilter (commercial)
- TET PDF IFilter (free/commercial)
- Adobe PDF IFilter (32-bit / 64-bit) (free)
If you have issues with PDF text searching in Windows 10, this article has detailed instructions for resolving PDF IFilter issues:
https://fixedit.itxpress.biz/2018/07/05/searching-pdfs-in-windows-10/
- Published in Database & Retrieval, Export, Office PDF Text Processing
How do you configure OCR to read index information from MS Office or PDF documents?
MS Office and PDF files generated by software or PDF printer drivers already have the text you need to recognize in the file. Scanned documents need to use OCR to read text from an image of the page. With Office and PDF files, SimpleIndex can just read the text, which is much faster and accurate than image OCR.
To recognize index fields from the document text, first create OCR fields on the Index tab as you would normally. Next, on the Zones & OCR options tab, check the “Use Full Page OCR for this Field” option for each OCR field. This tells SimpleIndex to process the existing file text.
If the index value is a unique pattern of digits or list of possible values, use Template or Dictionary matching to locate the value within the text. Please see the manual for details on Template and Dictionary matching.
If the value appears in a specific location in each file, coordinates can be used to locate it. When processing text, the X, Y, Width and Height settings correspond to line and column numbers within the file text. This is explained in greater depth in the manual.
SimpleIndex will assume that any TXT file with the same name as a file being processed is the OCR text for that file, so this method can work with any type of file.
Find out more about Optical Character Recognition on the SimpleOCR Guide.
- Published in OCR, Office PDF Text Processing
Can SimpleIndex create searchable PDF Image+Text files with hidden text?
If you enable full-page OCR and output to PDF, the full-page OCR text will be inserted as invisible text on each page.
With the addition of the FineReader Engine in version 7, SimpleIndex now creates PDF files with fully searchable text formatted to flow with the image of the document.
Find out more about Optical Character Recognition on the SimpleOCR Guide.
- Published in Export, OCR, Office PDF Text Processing