Processing and text extraction of Microsoft Office, Adobe PDF files, HTML files and other electronic documents.
Do Not Combine Pages to 1 Bookmark
If you want to keep pages in bookmarks separate instead of combining them into a single bookmark when the same bookmark value is found in several interspersed images in the batch do the following:
1. Open the Job Configuration file in Notepad.
2. Search for this value: <BOOKMARK_PDF1>
3. Enter this directly above the line that has <BOOKMARK_PDF1> if its not already there: <BOOKMARK_UNIQUE_LEVELS>-1</BOOKMARK_UNIQUE_LEVELS>
4. -1 is the default value and that means that no pages should be combined into one bookmark unless they fall in order. 0 means that the first bookmark level should be combined into one bookmark value and the rest should not. 1 means that the first and second bookmark levels should be combined and the rest should not be. ETC.
Can I split a PDF based on bookmark values?
SimpleIndex can create PDF files with bookmarks based on the index data captured in your batch.
Going the other way–splitting an existing PDF file based on the bookmark value–is not a built-in feature of SimpleIndex. However there are inexpensive command line utilities that you can integrate with SimpleIndex in order to accomplish this.
For example, the CoolUtils PDFSplitter and A-PDF Split both offer this function starting around $35.
The command line to split the PDF can be integrated into the Pre-Process setting in SimpleIndex, found under the Advanced Settings section of the Configuration Wizard. An example pre-process using PDFSplitter to split based on the second level bookmark values would be:
PDFSplitter.exe “c:\Images\BookmarkFile.pdf” “%CONFIGFILEFOLDER%\Input” -em bookmarks -b 2
Is it possible to search for and retrieve documents with Windows desktop search?
Windows Search works great with SimpleIndex because all index data can be saved to the folder and file names as well as the file properties, and OCR text can be saved to hidden layers in PDF files. Windows Search will read all of these elements when building its index and will return any matching files when you search.
Using Windows Search on a file server allows for instantaneous searching across terabytes of documents and text for all of the users on your network.
IFilters allow Windows Search to search within file contents.
Here are three popular PDF IFilters that will enable text searching for PDF files:
- Foxit PDF IFilter (commercial)
- TET PDF IFilter (free/commercial)
- Adobe PDF IFilter (32-bit / 64-bit) (free)
If you have issues with PDF text searching in Windows 10, this article has detailed instructions for resolving PDF IFilter issues:
https://fixedit.itxpress.biz/2018/07/05/searching-pdfs-in-windows-10/
- Published in Database & Retrieval, Export, Office PDF Text Processing
Can SimpleIndex create searchable PDF Image+Text files with hidden text?
Yes, it can. You can configure this setting in the Job Settings Wizard by going to the OCR step and checking “Enable full-page OCR”. There are many settings in the OCR step that you can used to customize the output and recognition of images.
SimpleIndex has two different OCR engines (Standard and Professional) that can be used to produced PDF Image + Text files or Searchable PDFs.
Related Links
- Published in Export, OCR, Office PDF Text Processing