AI and Large Language Models (LLMs) are the hot new trend when it comes to extracting data from documents, but don’t believe the hype! Most document-based workflows don’t need it, and are much more affordable without it.
Let’s face it. Document scanning, PDF processing, and data capture applications are old and boring. Optical Character Recognition (OCR) technology has been around for over 50 years. The most popular data processing platforms like Kofax (now called “Tungsten Automation”) and ABBYY FlexiCapture were developed over 25 years ago. There have been many incremental improvements over the years, but the core technology is the same.
LLMs or “Large Language Model” AI systems are the first radically new approach to data capture since the invention of OCR. Naturally, many software providers in this space have been eager to sell you these solutions, since they have not had much new to sell in quite some time. There have been billions invested in LLM-based AI solutions. More investment money depends on the success of AI than any other innovation in history.
With so much at stake, data capture companies that sell AI solutions have enormous incentive to lie to you about every aspect of LLMs:
- What solutions they are practical for
- How accurate and reliable the results are
- How easy they are to configure
- What they will do with your company’s data
- How much they will cost in the long term
If you are evaluating document processing or data capture solutions, this guide will help you critically evaluate the claims of the AI sales and marketing departments that have unleashed a tsunami of misinformation about their capabilities and costs.
This article was written by a human to avoid irony.
When are LLMs Practical for Document Processing?

The details of the LLM cost issues will be covered in more detail later, but they lie at the heart of the practicality problem. OCR is much less expensive on a per-page basis. So much so, that it makes the LLMs impractical for any document capture process that does not require the specific capabilities that only an LLM can provide. (You can find a more detailed cost comparison for document processing between LLMs and traditional OCRs here.)
What are those capabilities? LLMs excel when document data has no formatting or structure that allows for a template, pattern matching, or rule-based method of reliably identifying data fields. Examples of these would be letters, notes, articles, books, legal documents, and others where the desired data is contained within paragraphs of text instead of form fields.
Another practical use for LLMs is to create document summaries or classify them into categories. Once again, you need to have a good use case for it because they are expensive to generate. This can be useful for doing research where you have a large collection of various documents without reliable metadata and need to be able to identify which ones may be of interest without having to scan the contents of millions of pages.
The final use case is for small-volume applications where the setup cost of an OCR solution would be more than simply running it through an LLM and using prompts to get the data you want. What that volume is depends on the complexity of the documents and the data being captured, but the cutoff is usually around a few hundred pages for simple documents or a few thousand for more complex ones.
You can learn more about how OCR can use AI here, and what AI capabilities does SimpleIndex has here.
How Accurate is LLM-Based Data Capture?

If you have heard anything about Large Language Models then you have probably heard that they “hallucinate.” This means that they often give completely incorrect answers to queries that a human would easily understand. This is because they use statistical models to answer questions, not logic, pattern matching, or inference.
When it comes to data capture applications, hallucinations mean that these solutions will sometimes report incorrect results, and there is no good way for a human to provide a manual override that ensures these mistakes don’t continue in the future. If you do have the ability to override the AI results using zone coordinate or pattern matching templates, then you probably didn’t need an LLM in the first place! You’re still entering all of the same parameters to verify the AI data extraction that you would need to implement an OCR-based solution that can do the same thing for a fraction of the cost.
With AI hallucinations, you also don’t get a confidence score, so incorrect results are often hard to differentiate from correct ones. This can confound data validation efforts and raise red flags with compliance auditors.
There are many reports that claim to benchmark LLM versus OCR based solutions. (You can find some of such benchmarks here and Invoice OCR benchmark here.) With these tests it is important to compare the benchmark data to your own. For example, many of the invoice processing benchmarks are based on sample data that includes very poor quality or handwritten invoices that make OCR look much worse than LLMs. But how many of your vendors actually send you handwritten invoices? If like many companies the answer is “none” then it makes no sense to compare solutions based on their performance with these.

Another thing to consider is whether there are industry-specific data points that need to be captured which would not be part of the standard training data. For example, freight invoices can have many different coded charges for freight, handling, tariffs, fuel surcharges, and other items that are not part of the standard “line item” model for most invoices. LLMs can do a better job with these than OCR based solutions that have been “trained” on millions of general AP invoices, but they won’t always be able to map values that use a variety of different keywords to refer to the same data field. This is especially the case when dealing with international invoices in many languages. The more languages you have to deal with, the better traditional OCR does at supporting them, since the advanced AI-base invoice solutions typically only support English and a few other languages. If you prompt an LLM to provide you with the “handling fee” for an invoice that is printed in German, will it know to translate that label and identify it correctly?
How Easy Are LLM Data Capture Solutions to Configure?
In theory, configuring a document workflow using LLMs should be easy. You send it document images or OCR text along with English language prompts that request specific metadata, and it returns a structured data file with the metadata you asked for. How could it be easier?
This gets trickier when you have variability in how documents are labeled, and you aren’t using a model that was trained specifically on these types of documents. Let’s say you want to capture the “Customer Name” but you have some documents that refer to “customer” and others that refer to “member” or “recipient” or “bill to” or any number of others? Now you have to figure out how to write a prompt to tell the LLM that these are all the same data field. Engineering this prompt is now on par with the level of difficulty needed to set up an OCR solution to extract this data. The more detailed the prompt, the higher the cost per image.
Prompt engineering isn’t magic. You often need to give very specific instructions to an LLM in order to get the exact results you need when precision and accuracy are important. When a simple prompt doesn’t do the job, figuring out the exact way to phrase your question can be just as confusing as trying to figure out Regular Expressions.

What Will They Do With Your Data?
You need to be very careful with the data that you give to an LLM. Many times this data can be used for training, in which case the right prompt could cause your company’s data to appear in someone else’s response! This kind of AI data leakage can have serious repercussions when dealing with confidential HR, financial, or medical data.
Why do you want to give all of your trade secrets to Sam Altman, Jeff Bezos, Mark Zuckerberg, or Elon Musk? Do you trust these people to be stewards of your data and to never mine it for their own profit?
Traditional OCR solutions don’t even need to run in the cloud! A simple desktop PC with no Internet connection can process hundreds of thousands of pages per day using OCR with no risk of exposing this data to hackers or the data-driven oligarchy. (You can find in-depth research on topic of Cloud vs Sunshine (On-Premise) OCR here. )
The Long-Term Costs of AI

It is not always easy to compare the cost of document processing by traditional OCR and LLMs because of the large variety of pricing options for all solutions. Different OCR solutions will offer many options, including ICR (processing handwritten text), cloud support, server options, and many others.
There is good research on the topic of the total cost of ownership of OCR software; you can read about it here.
ABBYY FlexiCapture is more expensive per page, but be sure you are comparing apples to apples. FlexiCapture is a complete enterprise OCR solution, which includes interfaces for scanning, data verification, implementation of business rules, and integration with various back-end systems. With the exception of SimpleIndex, all of the others are API-based solutions that will require extensive coding in order to use them. There are other Enterprise OCR applications that implement these models and provide a full user interface, or you have to code the integration yourself.
As an alternative option, we used SimpleIndex OCR Server 1M offer ($1400 for 1M pages, that means $0.0014 per 1000 pages). As you can see, most of the LLM solutions are significantly more expensive than SimpleIndex, and still require you to build a user interface in order to make them useful. Plus, you can always implement any of these APIs with SimpleIndex and take advantage of its low-cost user interface with just a few hours of coding!
With current pricing, LLM-based data capture is often significantly more expensive than OCR. How much so depends on the types of documents and the prompts needed to extract data from them, but it can add up to 10x or even 100x the cost of OCR!
You also need to consider the fact that AI companies are heavily subsidizing LLMs in order to attract users. At current pricing, ChatGPT loses about $0.01-$0.05 on every transaction (source: asked ChatGPT). This is more than the retail price for OCR! When investors start demanding returns, the price of LLM processing will skyrocket. The current model is unsustainable, dependent wholly on investor funding, with no foreseeable path to profitability. You can learn more about AI pricing bubble in this research.
For decades the tech industry has used the model of using VC money to offer new technologies for free or sell them at a massive loss to build their user base and drive out competition. Once they have monopolized the new market, the prices start increasing. We’ve seen it with Amazon, Uber, AirBnB, steaming services, search and social media marketing, and every other innovation brought to us by Silicon Valley since the 90s. It’s so predictable at this point that it’s hard to believe anyone actually thinks this won’t be the case with LLMs!
Conclusion
As you might expect from an article published on SimpleIndex.com, it turns out that SimpleIndex is the easiest and most affordable solution for any OCR or document capture process that is within its capabilities! If you need an Enterprise Data Capture Platform, or have documents that require the specific capabilities of an LLM, then those solutions are available and we can help you find the right one.

