Optical Character Recognition (OCR) Primer – Part 5 of 5
In the last of our five part series on Optical Character Recognition, we take a look at Forms Recognition, extracting data using OCR and how to automate common business processes with OCR as well as calculating the ROI on an OCR process.OCR & Form Recognition
OCR & Form Recognition
Form recognition is an area very relevant to the document imaging industry. In fact, it is one way for a company to save money by automating processes that are now done manually. It is estimated that industry spends as much as $20 Billion annually on field coding, which means taking information that people have written or typed on form documents, invoices, etc., and keying this information into a database. Certainly, some of this work could be automated – the question is, how much?
There are multiple problems involved in “form recognition”. The most straightforward forms recognition to solve is to recognize a “fixed form”, where the form always has the exact same appearance. Even in a fixed form environment, where the form type can be detected with almost absolute certainty, there can be problems. Assume the task at hand is to identify the precise form type and then extract certain fields. It’s possible that the fields are handwritten, or even if machine printed, that OCR rates are less than 100%. So what needs to be learned is not just the form characteristics, but also the constraints on the different fields to be extracted, e.g., date field. For handwritten documents, ICR is less than reliable so that redundancy can also be a key factor in reliability. If there are multiple fields on the form that give the same, or database related, pieces of information these can be combined to yield a much higher recognition accuracy.
There are forms that are not fixed. Examples can include bank transaction statements that resemble business letters and differ based on issuing bank. There are Dept. of X files on an individual, where X could be Housing, Corrections, Employment, Education, etc. These documents may differ based on State of issue, and within each State, differ by County. The forms again may not be fixed, but may vary in structure. The field information may be embedded somewhere in the document.
Most form recognition problems where companies could potentially see serious ROI with a fully-automated or semi-automated recognition system, are beyond the capabilities of current off-the-shelf OCR, form recognition, and data extraction systems. That does not mean, however, that a solution cannot be engineered to a company’s specifications based on a company’s unique set of forms to be processed, data to be extracted, possible data redundancy, and other factors. Any system that involves 3 or more full-time data entry personnel, from menial data entry to more complex data entry and analysis is a candidate for automation (or at least semi-automation). Consulting a company with the right expertise in the area of form recognition (e.g., CVISION) can make all the difference.
Data Extraction with OCR
The data extraction problem is very closely coupled with form recognition. Usually, when a company needs data extraction it is in the context of form recognition. This means that one cannot extract meaningful data in the absence of recognizing the form type. This is distinct from the general OCR problem. The general OCR problem is to extract as much meaningful text from an image document as possible. There is no assumption about prior knowledge with respect to this document, other than perhaps what language the document is in. So OCR is ideal for full-text search where a database index needs to be constructed to allow for arbitrary text-based queries. But general OCR is not ideal for field coding, when certain fields need to be precisely coded into the record of a database and they must be entered correctly or there may be no way to find this document later.
For a reliable automated or semi-automated data extraction / field coding system to work, characteristics of the application need to be known ahead of time. These are aspects the system needs to train on to have effective recognition rates. For example, if the system being constructed or maintained is a University database then the fields necessary for each student record must be known a priori. In addition, whatever constraints are available for each record field must either be explicitly entered, e.g., XML-based file, or learned by the system during training. So the data extraction system, if looking to code a social security field per student, should know that the field is numeric, consisting of exactly 9 digits (with possible embedded “-“‘s).
Forms, Data Constraints and Redundancies: Check Deposit
There are many factors to solving the data extraction problem correctly. Among them are form-based constraints, data constraints, and data redundancies. For example, these three factors are all very useful in accurate coding of check deposit information. When you drop your checks off for deposit, there are usually some checks and a deposit slip. The deposit slip is usually handwritten, though company related information like account number may already be printed on the deposit slip. The checks themselves are either handwritten or machine printed. Routing and bank branch information are encoded on the bottom of each check using special numeric symbols that are easily recognizable.
One of the issues in solving this problem reliably is that the check deposit process usually contains handwritten data, and handwritten data recognition is still considered largely an unsolved problem. However, there are some data redundancies and form & data constraints that make the problem largely solvable. In particular, on the deposit slip, which is basically a form, there are boxes for each numeric character. This does not allow the user to write unconstrained cursive for the dollar amount. It also handles the difficult segmentation problem, as each numeric character has already been isolated. In addition, the state of numeric handwritten character recognition is significantly higher than unconstrained handwritten character recognition. Furthermore, the check deposit slip asks for each check amount, even though it is already on the check, and the check deposit total is requested twice. So there is redundancy with respect to each check amount, redundancy with respect to the total deposit, and additional redundancy in that the sum of all the checks (and other deposits) must ADD up to the total deposit amount. The semi-automated check deposit system in place at many large banks takes advantage of all these constraints and redundancies and, as a result, processes the average check for considerably less cost than 10 years ago, i.e., pre-automation.
I Know that I Don’t Know that I Know …..
What is very important in the design and implementation of any field coding / form learning / data extraction system is to know what you know. And what you don’t. The reason for this is simple: If an automated system performs correctly the Company saves money and see ROI (return on investment). If the automated system makes mistakes that go UNDETECTED, even occasionally, it could cost the Company a lot more in correcting the situation than the automation saved.
Going back to the semi-automated check deposit example: if the system correctly recognizes all the dollar amounts, on both checks and deposit slip, 90% of the time, is this a win for the Bank or not? The answer is totally dependent on whether the system knows what it knows. Meaning, if the system knows when a numeric value MAY be incorrect because all the numeric information, which is heavily redundant, is not in sync then any such case can be shown to a human operator without penalty so that the automation is a win for the bank. If the system, however, does not have the controls in place to verify the correctness of the extracted data then this system is probably not commercially viable, unless each transaction is shown to a human for the purpose of verification.
Business Process Automation and how it relates to OCR
Business Process Automation (BPA), also known as office automation, is the field concerned with identifying applications in a business that can be automated, then designing and implementing a solution. Unlike other areas in business, it is usually easy to make a business case for business process automation as any successful installment of such a system will reduce a Company’s manpower costs. Thus, with BPA it should not be difficult to show a return on investment (ROI) in some reasonable time.
So if the challenge in office automation is not justifying the installation of an automated system from an ROI perspective, where do the difficulties lie?
There are at least 3 issues that typically get in the way of a company deploying an application-specific automated system (ASAS):
i. engineering-specific obstacles to overcome;
ii. integration into existing workflow;
iii. modification of existing processes.
Let’s review each of these issues briefly.
i. Engineering-specific obstacles: Office Automation is generally not trivial and out-of-the-box solutions usually do not work. It is easy to get frustrated when your IT department can’t get the problem solved. Fact is that most automation solutions are complex, push the envelope with respect to current technology, and need to be engineered by experts.
Q: But if the automation venture has a strong upside but involves some risk, how do you minimize your Company’s exposure and come out on top?
A: Have the specifications for the working, automated system very precisely defined before approving the project. This way there is no ambiguity as to how the system will operate. If the system operates to specifications, then the Company’s IT department is sure the system will end up going online, and paying for itself. In addition, the Company should leave most of the risk with the automation engineering company, so that there is very little financial exposure until the system is up and running. On the other hand, the automation company (e.g., CVISION) will take the job only if it’s sure it can get the job done to spec. In addition, there needs to be a significant upside for automation company once the system is operational for taking the risk in design, development, and implementation of the automation system.
ii. Integration into existing workflows: Of the 3 issues listed above, this one is probably the easiest to overcome. In most companies, an automated system would replace some system already in place. As a result, it must be integrated into an existing workflow. This is usually straightforward engineering, with no research component and little risk, if any. Thus, if the automated system will show a clear ROI to the Company once operational, the integration component should not get in the way of making the project viable.
iii. Modification of existing processes: Sometimes, in order for an automation application to be successful, certain processes within the application need to be redesigned. In our check deposit example, this was true for many Banks. For example, Chase had to redesign their check deposit forms to support automation by requiring that the deposit total amount be entered twice. Similarly, many automation projects require some aspect of redesign in order to satisfy some of the specifications for the project to be feasible. This part of an automation project is a little tricky since often more than just one group of the Company may need to get involved once issues like form redesign come up.
The relevant decision factor here is, and should be, return on investment. If someone at the Company is convinced that the automation project will yield the Company significant ROI long term, and the specs have been carefully put together, then process re-engineering should not get in the way. In particular, no re-engineering (other than on a prototype basis) should be done until the automation group produces a running system that is shown to satisfy the system specifications. Once this prototype has been developed, such that the risk factors have been eliminated (or greatly reduced), then it is time to re-engineer the process workflow as needed for the system to go into production.
What do we mean by OCR-based ROI? It is the ability of a Company to show a clear return on investment from converting its paper documents to searchable electronic files. This metric is not always easy to compute. Unlike automation, it does not necessarily involve reducing the workforce, but may instead make the work time of existing employees more productive. Having a Google-like ability to find documents at a law firm, bank, financial institution, or insurance company should greatly reduce a Company’s cost of locating documents and make the related workers more productive. In the long term, this should involve a clear ROI for the Company, but might be difficult to quantify precisely.
If introducing fully text-searchable documents will reduce the size of the Company’s file room staff or mailroom staff then the ROI case becomes rather straightforward. Of course, if conversion to searchable documents, say from TIFF to PDF image + hidden text format, is combined with compression than it becomes easier to show tangible ROI. If the document file size after converting to compressed, searchable, web-optimized PDF is 10x smaller than before then, searchability notwithstanding, one can make a clear ROI argument based on reduced storage requirements, reduced document-related transmission bandwidth requirements, and reduced web-hosting fees.
Towards the Paperless Office
Are companies moving towards the goal of a paperless office?
The answer seems to be both YES and NO.
If the issue being considered is : Are companies putting systems in place such that, eventually, all corporate documents will be fully searchable? The answer seems to be YES. Electronic Content Management (ECM), making sure all company information is available and searchable electronically, is an effective way for companies to manage their documents and information. And companies seem to be clearly moving in this direction.
The related paperless office question is: Are companies getting rid of office paper? The question here appears to be unequivocally NO. All indications are that offices are printing record amounts of paper. The amount of paper being used on the corporate level shows no indication at all of slowing down.
Why is it, if companies are moving in the direction of Electronic Content Management, that paper use is up? There are several reasons given to explain the elusive paperless office. One reason for increased paper use is that a lot more corporate searching and research is going on at the Internet level, but this is only for finding documents, not for reading them. If one finds a relevant research report, white paper, manual, etc. available online they are very likely to print it in its entirety, rather than reading it off the screen. Then if this document is sent to a co-worker, they are also liable to print it. In the pre-Google days, there was a greater likelihood the physical paper copy would be passed around the office rather than copied several times.
Another reason why office paper use is on the rise, even with ECM, is that many documents are now kept electronically in the office. Meaning there is no physical paper copy that has ever been filed, say of a contract. On the upside, the corporate file room has been eliminated, downsized or moved offsite as relevant documents are all available electronically. In addition, the time to find relevant documents has been drastically reduced. The flip side of this is that document generation has becomes relatively easy such that these corporate files are generated from electronic copy “on demand”. This streamlines many document related processes, but also creates an increase in company print utilization as paper documents are generated dynamically.
We hope you have enjoyed our series on OCR technology. As always, the experts at WebDocs stand ready to help you with any of your document management and automation needs. Just contact us for a friendly, free, no obligation consultation.