So you want to digitize your documents? We're here to make that as simple and not terribly boring as possible!
This page has both a beginner's guide to document scanning concepts, as well as an advanced guide to automating batch document scanning processes with bar codes and OCR.
The guide is written to give you real information instead of marketing, but you can follow the links links to read about the relevant features of SimpleIndex and other document management solutions on ScanStore.
If you are not yet convinced that you should scan your documents, here are some of the biggest reasons:
A document scanner is a scanner with an Automatic Document Feeder (ADF), designed to take stacks of paper documents and scan them automatically. A wide variety of document scanners now exist that range in price from under $200 on up to hundreds of thousands for industrial models. What do you get for your extra money? Speed!
The faster the scanner, the less time it takes to scan huge volumes of paper. Time is always the biggest cost in large scanning projects. Time is the most valuable thing in life, next to money, which time also is. So you want a fast scanner.
Other than speed, you need to consider whether you have special requirements like portability, color detection, imprinting and other specialty features. The quality of the scanner's feeder, driver interface and scanned image also vary from brand to brand, with more expensive models having more reliable document feeders with larger hoppers and more consistent image quality. Virtual ReScan or VRS is included with many scanners and can make the image quality much more reliable while simplifying the settings interface. However, many scanners without VRS have equally good image enhancement software built in.
ScanStore has a handy scanners guide that has more information on scanner features and how to select the best scanner for your requirements.
The real first step is planning. We'll cover that in the next section, but for now here's what the actual day-to-day process of scanning a big bunch of documents is like.
First you have to get the documents ready. That means pulling any staples and paperclips, taping down loose edges, post-its, small documents and anything else that might get stuck in the document feeder.
In some cases you will need to insert barcode separator sheets to indicate the start of each new document.
You take these very neatly stacked piles of paper and feed them into the scanner. The neater the stacks, the less you have to open the thing up and pull out little bits of paper and staples, and generally makes for a more pleasant and swear-free work environment.
You'll have to use some kind of software to drive the scanner and save the images. Depending on the program it could require a little interaction or a lot to start scanning to the right place with the correct scanner settings. Most free programs will require you to use a "Save As" style dialog to scan and save files one at a time. This is OK for a few documents, but if you have hundreds or thousands you'll want something more streamlined for batch scanning.
While you're scanning you should watch the feeder to try and stop jams before they happen, while also watching the images on the screen to make sure the images aren't too light or dark to be legible. Though improved scanner quality, image enhancement tools like Virtual ReScan and color scanning have greatly lessened this concern, the person working the scanner should know what the correct settings are for different types of documents, how to set them and how to adjust them to make very light or very dark images legible.
The next step is for the scanned images to be processed. This means enhancing the image by straightening it, adjusting the color, cropping borders, removing hole punch mark--there are a variety of ways to improve the quality of scanned images. This not only makes them more readable to you, but also makes the next step more efficient...
...and that step is reading data from the documents. Either from the text or from those barcodes you put between them in the first step. Or, if you're clever or lucky, there were already bar codes on the documents when you got them. In any case, the better the quality of your scan the fewer exceptions you will have to deal with manually when the software can't read it correctly.
Remember those exceptions from the previous sentence? Now you have to handle those. Depending on the quality of the originals, the scanner and the recognition software, you could have a lot of exceptions to deal with or very few. In any case these will need to be reviewed by a human and have the missing data typed in.
Once all the exceptions have been dealt with, the images are exported to the document repository. This can be a network share, cloud service, SharePoint server, document management system, custom database or a variety of business applications that support attaching digital files.
When your images are saved in one of the aforementioned document repositories, they need to have relevant keywords and data associated with them so they can be organized and found later when you need them. The most basic way to do this is using folders and filenames on your hard drive. More advanced document mangement solutions will let you assign specific labels to each document such as name, date, reference numbers and any other information you might want to use to find each file. They can also include integrated viewers, storage systems, security and records management functions.
So before you begin choosing a scanning solution, you need to think about what type of document repository you need and what information you will use to label and organize those file in that repository. Some things to consider when selecting a document repository are:
When deciding what data you want to use to find your documents, consider these questions:
If you already have the data you can associate it with scanned images automatically without having to retype it. If the data is on the document as text or a barcode, it can be read from the image automatically with the right software. This process is discussed in detail in the next section.
There are 3 manual labor components to document scanning. These are the biggest cost of any scanning project, so automating these processes is the key to keeping the overall cost low. They are:
The first two steps are physical. They are only made more efficient by good ergonomics and a faster scanner. The third step is done in software, either by typing or by reading the necessary data from the document itself. This data can be read with bar codes, or by reading the text from the image with Optical Character Recognition (OCR).
Unless you skipped the last paragraph, you know that OCR stands for Optical Character Recognition. OCR software is able to take digital images of text and turn them into machine-readable text that can be searched or edited.
There are just a handful of OCR "engines" out there that implement the complex algorithms required to accurately read text, and every scanning application uses one of these engines. The low-cost and open source engines found in most desktop scanning applications give significantly less accurate results than the top-tier engines like ABBYY FineReader and Nuance OmniPage, especially on lower quality scans. So it is important to consider which OCR engine is being used in the software you select.
When scanning many copies of the same document type with identical layouts, you can use Zone OCR to read the text in a specific place on each page. This feature is supported by most scanning applications, but can be unreliable when other lines or text in the document are very close to the data you are trying to read, since shifting or skewing can occur in the scanner that cause the zone to mis-align with the image.
More advanced OCR applications can locate data on the page even when it doesn't always appear in the sample place. It can do this using pattern matching or by finding field labels. This type of software can also correct for document shift in zone OCR applications.
Barcode recognition is much faster and more accurate than OCR. If you are creating the documents you will ultimately need to scan, there is no reason not to put the key information you need to index them in a barcode. There are free barcode fonts available that will work with any document editor that lets you pick fonts.
You can also insert bar code separator sheets between documents to indicate a document break and provide index data. When scanning many documents at once, it helps not to have to stop the scanner between each new document. If you have multi-page documents where the number of pages is different in each one, separator sheets are recommended. Separator sheets can be printed in bulk when scanning large batches of existing documents, or one at a time for individual scans (useful for network scanners).
SimpleCoversheet is our free tool for creating barcode coversheets.
OCR cannot read hand printed text. For that you need ICR software and even then the handprint needs to be constrained with boxes or combs to be read accurately.
If you have a lot of data that needs to be read from a document, a forms processing application is more appropriate for the task than one designed for document scanning.
As mentioned previously, the quality of the OCR engine can make a big difference in how accurately a document can be read. This is especially true if a document has been copied, faxed or poorly scanned. A good rule of thumb is that if you have to read it twice or squint to figure out what it says, the software probably won't be able to read it either.
SimpleIndex is able to read barcodes, perform zone OCR and even uses pattern matching to locate data in different locations on each document. It has far more features than other scanning applications in its price range, and is far less expensive than other software with these features. It also uses the ABBYY FineReader OCR engine for the most accurate OCR results, as well as an advanced multi-engine voting technology for reading even the most degraded barcodes.
Its streamlined interface simplifies the scanning workflow, reducing many tasks to just a single mouse click. And finally it can integrate with a wide variety of document management systems, SharePoint, cloud storage solutions and third party applications. Even if your software includes a scanning interface, it is often much more efficient to use SimpleIndex instead, due to its powerful automation and simple interface.