Large documents (>500 pg) Slow to Process – Workaround
When working with PDF image files containing a high number of pages (typically in excess of 500, but can vary by file and PC running the job) SimpleIndex may run into performance issues as it attempts to hold all of those pages in memory and perform the requested operations (full-text OCR in particular can tax a system in these circumstances).
A workaround in this scenario is to convert the large PDF to a folder of smaller PDFs files that can be managed more easily. In order to minimize the impact on production and tax the user(s) with extra steps, you can use a third-party splitting tool that can be called from the Command Line. One such option that has worked well is PDFSplitter from CoolUtils
One way to automate this process is to use PDFSplitter’s command line ability in conjunction with SimpleIndex’s Pre-processing function. For simplicity let’s consider a 600 page PDF with a filename generated at the time of scanning using indexes provided on a coversheet or keyed by an operator. The goal now is to take that large file and perform a full-text conversion on it.
Our SimpleIndex job, Full Page OCR.sic let’s say, launches and before getting to work calls PDFSplitter from the Pre-processing step with a command such as
PDFSplitter.exe C:\Images\Smith – John – Medical History.pdf C:\Images\Pages\ -cp 100
PDFSplitter will run and break that document every 100 pages creating 6 PDFs in the folder C:\Images\Pages. It maintains the original filename, simply adding “001-100” and so on to the name. After PDFSplitter is complete the Full Page OCR job begins its process and, given that the original filename is still part of the split files’ naming schema, it can produce one full-text PDF in the final output folder.