Office PDF Document Indexing
SimpleIndex uses the existing text of Microsoft Office documents (Word, Excel, PowerPoint, etc.) and PDF files to extract data using RegEx patterns and database keyword matching. Scanned PDF files are converted to text with OCR. Automatically assign metadata and upload to any document management system.
If you would like to have SimpleIndex automatically go to the next page with a blank field and highlight that field when the Save Index button or Enter is hit then do the following.
Go into the Configuration XML and set it from False to True
You can set SimpleIndex to assume that it needs to check every PDF file and fix it.
Go to this location in the Windows Registry:
Create a New String Value called “FixAllPDF” and set the value to 1
If you want to keep all the pages in the same order that they were imported, even though they all go with different bookmarks then do the following.
1. Open the configuration in Notepad.
2. Search for <BOOKMARK_PAGE_ORDER>
3. Change this line from “false” to “true”: <BOOKMARK_PAGE_ORDER>true</BOOKMARK_PAGE_ORDER>
4. Save and close.
Windows Search works great with SimpleIndex because all index data can be saved to the folder and file names as well as the file properties, and OCR text can be saved to hidden layers in PDF files. Windows Search will read all of these elements when building its index and will return any matching files when you search. Using Windows Search on a file server allows for instantaneous searching across terabytes of documents and text for all of the users on your network. IFilters allow Windows Search to search within file contents. Here are three popular PDF IFilters that will enable text searching for PDF files: Foxit PDF IFilter (commercial) TET PDF IFilter (free/commercial) Adobe PDF IFilter (32-bit / 64-bit) (free) If you have issues with PDF text searching in Windows 10, this article has detailed instructions for resolving PDF IFilter issues: https://fixedit.itxpress.biz/2018/07/05/searching-pdfs-in-windows-10/
Is there a way to just use part of a bar code or OCR value? For example, extract “50” from the value “124450”
To do this example, create a barcode field (Field 1 for example) and a 2nd field with type “Fixed”. In the template for the 2nd field, enter %FIELD1[5,2]% to get “50” from “124450”.
%FIELD1% would get the entire value for Field #1, the barcode field. By adding the [5,2] you tell SimpleIndex to start at the 5th character (5) and take 2 characters from the value (50).
You can tell SimpleIndex what types of files it should process and which file types to ignore. This is done by clicking “Job Options” On the “Batch” tab you will find a field labeled “Input file types or mask”. These are the file types that SimpleIndex will input files from. The default types are: TIF,PDF,JPG,GIF,BMP,DOC,XLS,PPT,DOCX,XLSX,PPTX,VSD,DWG,AVI,MP3 To process all files, enter * SimpleIndex will ignore any file whose extension does not appear on the list. In SimpleIndex 6 or above you can enter file masks to filter input files. Some examples are: abc*.pdf (PDF files starting with “abc”) ab??ef.* (All files starting with “ab”, 2 characters and “ef”) It is possible to have some file types open automatically in their default application. This can be done by inserting a pipe “|” into the list. Any file types after the pipe will be opened in their default application. For example: TIF,PDF,JPG|WAV,M
MS Office and PDF files generated by software or PDF printer drivers already have the text you need to recognize in the file. Scanned documents need to use OCR to read text from an image of the page. With Office and PDF files, SimpleIndex can just read the text, which is much faster and accurate than image OCR. To recognize index fields from the document text, first create OCR fields on the Index tab as you would normally. Next, on the Zones & OCR options tab, check the “Use Full Page OCR for this Field” option for each OCR field. This tells SimpleIndex to process the existing file text. If the index value is a unique pattern of digits or list of possible values, use Template or Dictionary matching to locate the value within the text. Please see the manual for details on Template and Dictionary matching. If the value appears in a specific location in each file, coordinates can be used to locate it. When processing text, the X, Y, Width and Height settings correspond to