Use Full Page Text

SimpleIndex can use existing text from any MS Office, PDF or text-based format like HTML or XML, to use template and dictionary based pattern matching for automatic indexing of documents.

Use the Text Source setting in the Index Field Wizard to set a field to Use Full Page Text to get its value.

Use the Skip OCR if Text Exists option to perform full page OCR only on pages that don't have text, such as PDF files could be images or text.

Use the Zone Coordinates to indicate the row and column of the text. The full path to the input file is added as the first row and indexed at 0. So entering 0 in the Y coordinate will include the input file path, entering 1 will start the template or dictionary match at the first line of text.

Switch to the OCR Text View to view the source text used for each file. To find row and column values we recommend copying and pasting the text into a proper text editor like Notepad++ that displays the row and column values. We could build one into SimpleIndex but there's a lot of things we could do and only so much time!

PDF text layers are made up of separate text objects instead of flowing text like a MS Office document. As such you may find text at the top of the page appears unexpectedly at the bottom. You will see the same thing if you copy and paste the text from Acrobat and other applications. Unfortunately this can't be helped and you must plan your jobs accordingly.