Our Blog

Need more info on WebDocs? Check out our wide array of WebDocs resources!

14

Apr 2014

Reducing Inefficiencies: Using ICR and OCR to Streamline Business Processes

Posted by / in Blog / No comments yet

According to Collegecrunch.org, the average salary for a data entry clerk is $12.86/hr or almost $27,000 a year. These are people whose sole function is to input data into your system, verify it, and appropriately process it. How many people do you have doing this kind of work in your company? How much are you paying them? Could their time and talents be put to better use?

For many businesses, data entry and processing is a huge part of their business process. It is how they get information provided from outside sources into their data management system so that they can fill orders, process payments, address issues, and make informed business decisions. This puts data entry at the heart of all business enterprises and creates a critical need for more efficient and accurate data entry processes. One way to enhance the efficiency of data entry is to automate data input.

This can be accomplished using two technologies, called Optical Character Recognition (OCR) and Intelligent Character Recognition (ICR), either separately or in concert with one another. Both are types of software that take information on a document that has been digitally rendered and translates that document into a language that the computer can understand. In essence, OCR and ICR are computer programs that act as a translator for written human communication. The primary differences between OCR and ICR are what each is capable of reading. Optical Character Recognition programs are only capable of reading characters created using a printer or type writer. Intelligent Character Recognition is capable of reading handwriting. There are limitations of course; for instance, neither ICR nor OCR will read characters rendered in script font such as the Cyrillic alphabet or cursive handwriting, also they would have a difficult time reading a free form letter, but on standard forms and correspondence, these technologies are highly effective.

Here’s how it works, you take a document that you would like to input into your data management system (like WebDocs or a line of business application such as your Accounting package) and you have your computer capture an image of the document using a scanner. You then send that image to your ICR/OCR program. The ICR/OCR software overlays that image with a grid (think graphing paper) and maps out the whole sheet, then it compares the image that appears in each square on the grid against a database of characters that it was programmed with. If the computer is unable to create matches for your characters, it generates an error and lets an operator know that the document will have to be processed manually. The software takes the images that it is able to read and imports it into your data management system, where it is filed digitally.

So how can you use this technology to streamline your data entry processes? This is pretty simple, by employing ICR/OCR software you can feed incoming communications into the program, set it so that one of your data entry professionals reviews the computer’s work for accuracy, and then allow the program to send a copy to your data management system for digital filing and processing. For images that the computer is able to translate, the only time your data entry personnel have to touch that piece of data is when they approve it. In a traditional system, data entry personnel would have to, at bare minimum, enter the data, verify that it was entered correctly, and then file the information. If you have a 100 forms, that equals a minimum of 300 tasks for that batch of forms, but with an ICR/OCR program that is able to read 80 of those forms, you have reduced the number of tasks to 140. If each tasks takes an average of thirty minutes, with a data entry clerk making $12.86/hr, in a traditional system you are looking at a processing cost of $1,929 per hundred forms, solely in man hours. Using an ICR/OCR system you would spend only $900 on processing man hours, a difference in savings of of $1,029 per one hundred forms. Of course, this is a very simplified model; your actual savings would vary based on the rate you pay your data entry clerks, the number of tasks required for each piece of data, their average time per task, and the number of forms your system can interpret, but you can see from this exercise that the potential savings can be enormous.

But, if the software doesn’t do all the work what’s the point? Allowing a computer to do the bulk of the manual processing takes less time than inputting the data manually; freeing your employees to do more important things. Furthermore, it will also increase the rate at which data is processed, and you can see from my previous illustration how the increased speed in processing would result in fewer man hours and thus a significant cost savings. It also has the added advantage of decreasing the processing time for payment receipt; for instance if you are processing mail from your payment center and the average speed to process that is 3 days you can easily decrease that processing time by half or more. Another advantage is that OCR and ICR don’t just take images that you scan; you can also use ICR/OCR to automatically input the information coming in through your fax and email systems. This is accomplished by assigning an email address and/or fax number to the ICR/OCR program, then faxes and emails can be routed directly to the software, where software would perform the work of translation. This will reduce paper usage and further increase the speed of data processing and cost savings.

There are some important things to know here. First, ICR/OCR programs are NOT sentient robots! This means that you are still going to need your data entry staff, their job functions will change though. Instead of spending hours typing in data they will spend time reviewing what the system has read and making corrections as necessary. Also, there will be times when the software is unable to read the document meaning that some documents will still require manual processing, just a lot fewer of them. With the smaller workload you are looking at significant cost savings and having your data entry staff freed up to perform other tasks. Overall, you are looking at significant return on investment in both time and cost on any ICR/OCR integrations that you invest in. For more information on how you can integrate ICR/OCR software into your business processes contact us.

Please select the social network you want to share this page with:

08

Apr 2014

There’s no future in on-premises IT — it’s time to move to the cloud

Posted by / in Blog / No comments yet

I’ve been lying to myself: I thought IT would survive the next shift in technology as all infrastructure moves to the cloud. But I no longer believe IT will survive that cloud shift — certainly not IT as we know it. Sure, there will always be on-premises dinosaurs like myself who prefer to install Exchange manually. But the shift to the cloud is coming.

The debate between operating expense (aka opex, the cloud’s approach) and capital expense (aka capex, the on-premises approach) is waged daily at companies. Although there are trade-offs no matter what a firm chooses, it’s clear that opex is increasingly favored. In the age-old rent-versus-buy debate, the cloud is making rental very compelling, especially as managed cloud environments begin to implement tools that provide automated spin-up/spin-down to avoid excessive consumption of resources (and higher costs).

I recently read “The Big Switch” by Nicholas Carr, the same guy who wrote the controversial 2004 title “Does IT Matter?” that questioned the future relevance of IT. His personal opinion aside, what struck me about “The Big Switch” is Carr’s explanation of how the use of electricity shifted from on-premise systems — companies had electricity departments, complete with electrical architects and managers, sort of like IT departments today — to third-party electrical grids that businesses simply tapped (like today’s cloud services). Electricity went from an item on which a business focused half its time, attention, and labor to a simple utility it plugged into and paid for. Like it or not, the same is happening in IT.

Some of us remember the days when IT held great prominence in any business and, as a result, saw a lot of money sent its way. But 20 years later, businesses sign up for Office 365, and an office manager adds user accounts to a simple Web interface. What was once handled by IT is becoming a commodity service provided by something very much like a public utility.

I recently logged into a Windows Azure portal, picked a template that included Windows Server 2012 and SharePoint, and spun it up in minutes while I went and got a cup of coffee. It was finished well before I returned.

I could have done the same thing on-premises too, but not without an infrastructure and a good deal of time to get the software in place to make it happen. By contrast, with Azure I didn’t have to do much of anything but choose what I wanted. It doesn’t matter whether you use Azure or another vendor’s service (I opt for Azure because it works with the Microsoft technologies I already know); the patterrn is the same: Log in, make a few choices, and you’re done. That’s not the future — that’s the present.

If you started in this profession in the late 1990s as I did, you might tell yourself, “I’m going to be that on-site expert that will always be in need.” You might actually make it through to retirement with a few PowerShell classes, some automation experience, and a few sessions at a conference about System Center.

But if you’re looking for any longevity in your IT career, you need to move past that kind of minimal coping. The shift to a full cloud-based infrastructure-as-a-service (IaaS) or platform-as-a-service (PaaS) model is imminent. Yes, in the next five years there’ll be hybrid and convergence solutions teed up across the board to provide a transition from on-premises to cloud, but ultimately there’ll be little on-premise IT left for admins.

I recently asked a room of IT pros how many of them had worked with on-premises virtualization for their servers. They all raised their hands proudly, as if they were “modern” and not living in the last decade. Then I asked how many have used a cloud-based service like Azure for testing or production servers. Only two out of 30 people raised their hands. That shows the risk most IT admins face today, whether they know it or not. For IT, 2000 to 2010 was about virtualization, but that’s done. It’s no longer modern, simply the new legacy. The current decade may be focused on convergence and hybrid models. After that, it’s all cloud from what I can see.

To avoid irrelevance, you must change and grow. You can dig your heels in and cling to the past, as I have done often enough, but that will not help you once cloud-based tools overcome the increasingly addressed concerns over security, availability, performance, and flexibility.

You won’t be the first to make such an adjustment. Thomas Edison, who invented the earliest system for electricity distribution in 1880, initially used direct current (DC). He didn’t want to adopt alternating current (AC), though it could use higher voltages with transformers to step it down for distribution to homes and businesses. Competitors like Westinghouse were all about AC, but Edison doubled down and waged a propaganda war against the technology, trying to ban AC — even electrocuting animals publicly to demonstrate the danger of AC. Ultimately, as we know, he failed, and AC current became our norm.

Once you see technology going in a direction that’s unstoppable, don’t get in the way or try to stem the tide. Instead, get on board. The cloud train is here.

Please select the social network you want to share this page with:

02

Apr 2014

All About PDF – Part 3 of 3

Posted by / in Blog / No comments yet

To wrap up our three part series on the PDF format, I want to talk about the PDF/A format.  PDF/A was developed to specifically address long term archivability and accessibility needs for document images that have long term retention requirements.

What is PDF/A ?

PDF/A is a family of ISO standards for formats of Adobe PDF intended to be suitable for long-term preservation of page-oriented documents. The standard defines:

“… a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files.”

The Standard does not define an archiving strategy or the goals of an archiving system. It identifies a file format for electronic documents that ensures the documents can be reproduced the exact same way in years to come. A key element to this reproducibility is the requirement for PDF/A documents to be 100 % self-contained. All of the information necessary for displaying the document in the same manner every time is embedded in the file including, but not limited to, all content (text, raster images and vector graphics), fonts, and colour information. A PDF/A document is not permitted to be reliant on information from external sources such as fonts and hyperlinks.

PDF/A-1 (the first PDF/A standard) is a file format, a proper subset of Adobe PDF 1.4 (Acrobat 5), defined by international standard ISO 19005-1 in 2005. It was created to facilitate the long-term storage of digital documents. As a proper subset of PDF 1.4, PDF/A files should be readable by any PDF reader which conforms to PDF 1.4 or higher.   PDF/A-1 is an ISO Standard (ISO 19005-1:2005) for using PDF format for the long-term archiving of electronic documents.

PDF/A-2 extends the PDF/A-1 standard and is based on PDF 1.7 (Acrobat 8) and is an ISO Standard (ISO 19005-2)

Introduction


Adobe PDF (Portable Document Format) is a universal file format for document exchange that preserves all the fonts, formatting, colours, and graphics of any source document (whether it originated on paper or from the Web or other electronic sources). Preservation is faithful regardless of the application and platform used to create or view the material. Adobe PDF files can be shared, viewed, navigated, and printed on a broad range of operating systems by anyone using free Adobe Acrobat Reader™ software.


Traditional archiving methods (such as paper and microfilm or microfiche) guarantee reproducibility but are outdated for modern technology. Large documents cannot be quickly sent around the globe and it is difficult to search archived documents for specific content.  TIFF guarantees reproducibility in the long-term and has an established structure. TIFF is also easy to transmit in a worldwide business environment but is not easily searchable. PDF can be a more attractive archiving format than TIFF for a variety of reasons: PDF stores structured objects (e.g. text, vector graphics, raster images), allowing for an efficient full-text search in an entire archive; and metadata like title, author, creation date, modification date, subject, keywords, etc. can be embedded in a PDF file. PDF files can therefore store textual data (such as word processed documents and spreadsheets) and/or scanned documents. PDF files can be automatically classified based on the metadata, without requiring human intervention.

Scanned paper documents are stored in an image (rather than text) format. TIFF is a file format commonly used for storing digital versions of paper documents because it is a standard format for most scanners and software applications. However, the advent of Adobe Portable Document Format (PDF) has added new dimensions and powerful capabilities to electronic documents because Adobe PDF is more extensible than other image-based formats. Scanned Images can be stored in PDF files along with textual information.

The inventor of the PDF Standard, Adobe Systems, publishes new versions of PDF frequently. Each new version has enriched the format with countless new features and has updated some of the older features. It was therefore necessary to define a stable derivative of the PDF format, based on Adobe’s proprietary PDF specification, that could be internationally accepted as a Standard for long-term electronic archiving. The result: PDF/A
.

The PDF/A Standard


ISO 19005-1 defines “a file format based on PDF, known as PDF/A-1, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rending the files.” (from ISO 19005-1). The Standard does not define an archiving strategy or the goals of an archiving system. It identifies a “profile” for electronic documents that ensures the documents can be reproduced in years to come.


A key element to this reproducibility is the requirement for PDF/A documents to be 100 % self-contained. All of the information necessary for displaying the document in the same manner every time is embedded in the file. This includes all visible content like text, raster images, vector graphics, fonts, colour information and much more. A PDF/A document however is not permitted to be reliant on any information from direct or indirect external sources, for example links to external image files or font that are not embedded.

PDF vs PDF/A


PDF in its native form cannot guarantee long-term reproducibility and not even the “WYSIWYG” (what you see is what you get) principle. Certain restrictions and amendments had to be incorporated into the Standard. To be accepted, PDF/A (sometimes referred to as PDFA) needed to be based on an existing version of the PDF Reference and not on anticipated functionality in a future version. For the first version of the PDF/A standard, the ISO chose the Adobe PDF Reference 1.4, which Adobe implemented in Acrobat 5, as the basis for the Standard. The ISO Standard states that PDF/A-1 “shall adhere to all requirements of PDF Reference as modified by this part of ISO 19005”. The Standard itself identifies only differences with respect to the PDF Reference. In order to fully understand PDF/A, you have to also understand the PDF Reference 1.4.


In PDF/A-1 certain functionality allowed in PDF 1.4 has been specifically excluded, for example transparency and sound and movie actions. There are also elements described in the PDF Reference 1.4 that are not mandatory. PDF/A on the other hand requires these elements to be implemented, for example embedded fonts.
 

In short, PDF/A-1 is based on the Adobe PDF Reference 1.4 (Acrobat 5), with specific features being either mandatory, recommended, restricted, or prohibited.

PDF/A-1 files must include:

  •  Embedded fonts
  •  Device-independent colour
  •  XMP metadata

PDF/A-1 files may not include:

  •  Encryption
  •  LZW Compression
  •  Embedded files
  •  External content references
  •  PDF Transparency
  •  Multi-media
  •  JavaScript


PDF/A has been established as a set of standards with several parts:- 

The PDF/A-1 : A-1a, A-1b Standards (ISO 19005-1)

PDF/A-1 (Part 1) has been approved. PDF/A-1 is further subdivided into two levels of compliance: PDF/A-1a and PDF/A-1b

PDF/A-1a (Level A Conformance) denotes full compliance with the currently approved PDF/A Standard ISO 19005-1: Part 1. In addition to exact visual reproduction it also includes mapping text to Unicode and structuring of the document content

PDF/A-1b (Level B Conformance) is a “minimal compliance” level for PDF/A. PDF/A-1b requirements are meant to ensure that the rendered visual appearance of the file is reproducible over the long-term. It requires exact visual reproduction only.

PDF/A-1a and PDF/A-1b differ primarily with respect to text extraction.  The difference between PDF/A-1a and -1b has no impact for scanned documents, provided the files have not been enhanced by means of OCR for searching (PDF/A Searchable Files).

The PDF/A-2 : A-2a, A-2b, and A-2u Standards (ISO 19005-2)


PDF/A-2 – A new development of PDF/A to address some of the new feature added with versions 1.5, 1.6 and 1.7 of the PDF Reference. PDF/A-2 is backwards compatible, i.e. all valid PDF/A-1 documents should also be compliant with PDF/A-2. However PDF/A-2 compliant files will not necessarily be PDF/A-1 compliant. PDF/A-2 is based on PDF 1.7 (as defined in ISO 32000-1) which supports a range of improvements in document technology such as JPEG2000 compression, transparency effects and layers, the embedding of OpenType fonts, and provisions for digital signatures in accordance with the PDF Advanced Electronic Signatures standard.  PDF/A-2 also allows archiving of sets of documents as individual documents in one file.

The PDF/A-2 standard defines three levels of conformance:

PDF/A-2a (Level A conformance) satisfies all requirements in the ISO 19005-2 specification.
PDF/A-2b (Level B conformance) is a lower level of conformance, “encompassing the requirements of this part of ISO 19005 regarding the visual appearance of electronic documents, but not their structural or semantic properties.”
PDF/A-2u (Level U conformance) An intermediate level of conformance has been introduced for PDF/A-2; Level U conformance represents Level B conformance with the additional requirement that all text in the document have Unicode equivalents.

The PDF/A-2 support for JPEG 2000 is restricted in ways that increase compatibility with PDF/X, e.g., constraining the number of colour channels (to 1, 3, or 4)


The PDF/A-3 : A-3a, A-3b, and A-3u Standards
(ISO 19005-3)

PDF/A-3 adds one feature to PDF/A-2. Permitting the embedding within a PDF/A file a file (or files) in any other format, not just other PDF/A files (as permitted in PDF/A-2). This has important implications for archival – see the note below.

As in PDF/A-2, the PDF/A-3 standard defines three levels of conformance:

PDF/A-3a (Level A conformance) satisfies all requirements in the ISO 19005-3 specification.
PDF/A-3b (Level B conformance) is a lower level of conformance, satisfying requirements intended to be those minimally necessary to ensure that the rendered visual appearance of a conforming file is preservable over the long term. The specification notes that “Level B conforming files might not have sufficiently rich internal information to allow for the preservation of the document’s logical structure and content text stream in natural reading order, which is provided by Level A conformance.”
PDF/A-3u (Level U conformance) An intermediate level of conformance, Level U conformance corresponds to Level B conformance with the additional requirement that all text in the document have Unicode equivalents.

Important Note – PDF/A-3 as an archival format
    In a PDF/A-3 file, any embedded files should be considered ‘non-archival’. In other words, the embedded file is considered as only of short-term or temporary use.  Only the primary PDF content with its visible page display should be considered as ‘archived’ for the long term.

 

PDF/A requires a complete solution


PDF/A is only part of a complete archiving solution. PDF/A alone does not guarantee long-term archiving and it does not guarantee that information will be displayed as desired. PDF/A also does not claim that a PDF/A-based archive is always the best solution. However, it you decide to use PDF, then PDF/A defines a set of requirements that make long-term archiving possible.

Other aspects that must be taken into account when implementing a PDF/A-compliant archive include, for example, corporate standards and procedures, reliable data sources, reliable fonts, quality management and special individual requirements. The migration of current paper -or TIFF- based archives to PDF/A compliant archives is not an insignificant task and must be well planned.

Both Microsoft (Office 2007) and OpenOffice (from release 2.4) have added PDF/A export to their office software.

We hope you have enjoyed our series on PDF formats, their history and variants, and that you have a better understanding of the pros and cons of each format in your document management strategy.  If you still need help deciding which format is best for your organization, let the WebDocs team help guide you with friendly, free advice based on our years and years of document management experience!  We’re always glad to help.

25

Mar 2014

All About PDF – Part 2 of 3

Posted by / in Blog / No comments yet

Last week we looked at the history and major variants of the Adobe PDF format.  This week I want to delve into the searchable PDF format which is the most often requested variant of the four major types we discussed last week.

Introduction

Scanned paper documents are stored in an image (rather than text) format.  TIFF is a file format commonly used for storing digital versions of paper documents because it is a standard format for most scanners and software applications. However, the advent of Portable Document Format (PDF) has added new dimensions and powerful capabilities to electronic documents because PDF is more extensible than other image-based formats.

PDF (Portable Document Format) is a universal file format for document exchange that preserves all the fonts, formatting, colours, and graphics of any source document (whether it’s on paper or from the Web or other electronic sources). Preservation is faithful regardless of the application and platform used to create or view the material. PDF files can be shared, viewed, navigated, and printed on a broad range of operating systems by anyone using free Adobe Acrobat Reader™ or other software.

With scanning software, volumes of legacy paper documents may be converted to PDF so you can search, annotate, publish, and archive all of your information in a digital environment. 

However there are different types of PDF for use when scanning paper-based documents:

PDF Image Only
PDF Searchable

PDF Image Only

PDF Image Only is the simplest scanning for documents that don’t require searchable text.  PDF Image Only takes a bitmapped image of a document (like a TIF file) and applies a PDF wrapper to that raster image. Because PDF Image Only files do not contain OCR text, their content is not searchable. But the file can be integrated with other Adobe PDF documents and read by anyone on any platform with Adobe Acrobat Reader software. In addition, you can add keywords to the file, so you can search for the document later.

PDF Image Only is ideal for transactional documents, such as invoices and forms. For example, you can use Image Only to scan invoices into an imaging archive. Digital versions of invoices must be absolutely faithful to the originals, yet they are rarely retrieved once they have been entered into the system. When an invoice does need to be retrieved, it can easily be found with an index search for the invoice number or customer name.

PDF Searchable 

PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach retains the look of the original page while enabling text searchability.

A document created in PDF Searchable Image offers the best of both worlds—an exact replica of the original document that is also fully searchable. PDF Searchable Image files contain two layers: a bitmapped (image) layer and a hidden text layer. The bitmapped layer maintains the visual representation of the original document. The text layer contains the Optical Character Recognition (OCR) version so you can search for any word on any page. PDF Searchable Image comes in two variants: Exact and Compact. These two are similar in many ways, but they have a few key differences.

PDF Searchable Image Exact

The Exact version of PDF Searchable Image—also known as PDF Image+Text—is great for preserving your most richly coloured, intricately designed documents. This PDF flavour stores image information on one layer and maintains a text version of the document on another hidden layer, so you can easily search your documents.  The Exact option preserves colour as 8-bit to 24-bit files, so you can distinguish between shades of the same colour and between multiple colours on a page. The trade-off is a larger file size. So if you plan to post your files to your intranet or e-mail them to co-workers around the globe, PDF Searchable Image Exact may not be the best option. However, if you are archiving your corporate data for later use, your need for accurate, searchable files may outweigh concerns about file size. In that case, PDF Searchable Image Exact may be preferable.

PDF Searchable Image Exact is the format normally used for searchable PDF scanning and is often referred to simply as PDF Searchable.

PDF Searchable Image Compact

PDF Searchable Image Compact uses a new colour-segmentation process to create small file sizes from certain types of colour documents. The Compact format is advantageous when the document you need to scan has some regions that are colour images and some regions that are monochrome (for example, text in any two colours). When you choose the Compact option, software should automatically segment the page into two types of regions. Image (colour) regions are stored within the PDF file as JPEG data. Text (monochrome) regions are stored within the file as G4 or Zip compressed data.  Depending on how large the text regions are in the original document, this storage process can substantially reduce file size. For example yellow text on a blue background that would otherwise be saved as 8-bit to 24-bit colour can now be saved as 1-bit colour.

By producing smaller files, PDF Searchable Image Compact makes it easier for you to share your electronic documents, output them to printers, and post them on your Web site. The Compact option works best for documents that have either a few colours or colours that are distinct from one another. For example, corporate letterhead is a good candidate for PDF Searchable Image Compact because logos with limited colour that would otherwise have to be saved as large, 8-bit images can be saved as 1-bit images.

Text Accuracy 

The OCR process required to create PDF Searchable Image typically provides text accuracy of 97 to 99 percent. One to three wrong characters for every 100 may seem like a lot errors. But this is not a problem for those applications that this approach is designed for. Since the user sees a scanned image representation of the original paper page, OCR errors will not be visible to the eye. The errors are only an issue when searching or copying text, which accesses the hidden text layer.  If a higher accuracy level is desired, the document will have to manually proofread and corrected.

Next week we will wrap up our series on PDF discussing PDF/A, the PDF variant that was specifically developed for archiving and long term accessibility needs.

18

Mar 2014

All About PDF – Part 1 of 3

Posted by / in Blog / No comments yet

Whenever we start a new document imaging project, one of the questions that inevitably comes up is “What format should we scan our documents into?”  And the PDF format is the one that usually generates the most questions and requires the most explanation.  In this three part series, we will take a look at the history of the PDF format and try to explain some of the various sub formats that exist.

The Portable Document Format (PDF) is the file format created by Adobe Systems in 1993 for document exchange. PDF is a fixed-layout format used for representing two-dimensional documents in a manner independent of the application software, hardware, and operating system. Each PDF file encapsulates a complete description of a 2-D document (and, with Acrobat 3-D, embedded 3-D documents) that includes the text, fonts, images, and 2-D vector graphics that compose the documents.

PDF is a universal file format for document exchange that preserves all the fonts, formatting, colours, and graphics of any source document (whether it’s on paper or from the Web or other electronic sources). Preservation is faithful regardless of the application and platform used to create or view the material. PDF files can be shared, viewed, navigated, and printed on a broad range of operating systems by anyone using free Adobe Acrobat Reader™ software.

PDF files look and print exactly as intended across a wide variety of platforms. Free Acrobat Reader software is easy to download from the Adobe Web site. More than 200 million copies have been downloaded or preloaded onto personal computers. This means that if you want to distribute documents to a broad audience, the ubiquity of Acrobat Reader software ensures that all your PDF files can be read by anyone across a broad range of computing environments.

Navigating PDF documents, even long ones, is easy. Unlike standard TIFF files, PDF files can hold navigation information—such as hyperlinks for tables of contents, indexes, and URLs—all within one self-contained file. It’s easy to search for any word or number in a document and see the desired information in context on the page.

The PDF file format has changed several times, as new versions of Adobe Acrobat were released. There have been eight versions of PDF with corresponding Acrobat releases:

(1993) – PDF 1.0 / Acrobat 1.0
(1994) – PDF 1.1 / Acrobat 2.0
(1996) – PDF 1.2 / Acrobat 3.0
(1999) – PDF 1.3 / Acrobat 4.0
(2001) – PDF 1.4 / Acrobat 5.0
(2003) – PDF 1.5 / Acrobat 6.0
(2005) – PDF 1.6 / Acrobat 7.0
(2006) – PDF 1.7 / Acrobat 8.0

The Four Types of Adobe PDF documents

With scanning software, volumes of legacy paper documents may be converted to Adobe PDF so you can search, annotate, publish, and archive all of your information in a digital environment. 

However there are four different types of Adobe PDF for use with paper-based documents:

PDF Image Only
PDF Searchable Image Exact
PDF Searchable Image Compact
PDF Formatted Text and Graphics

Each provides distinct advantages that enable you to customise your electronic files to meet your information needs. By examining the type of document you’re converting and how you intend to use the electronic file, you can choose the most suitable PDF option.

The following table summarises the four types and the best use for each:

PDF Type Notes
PDF Image Only • Transactional documents
• Documents that don’t require searchable text
• The simplest scanning
PDF Searchable Image Exact
(also known as PDF Image+Text)
• Documents that need to retain scanned images for  legal accuracy
• Documents that you need to be able to search quickly
• Full-colour documents
• The OCR’d text is held ‘behind’ the document to allow full-text searching
PDF Searchable Image Compact  • Documents that need to retain their original look but must be reduced to the smallest possible file size for network distribution or Web posting
• Possibly ‘lossy’ compression
PDF Formatted Text and Graphics
(also known as PDF Normal)
• Documents that need to have the highest possible on-screen viewing and printing quality
• Documents that must be reduced to the smallest possible file size for network distribution or Web posting
• A conversion from a scanned image to a text + graphics document
• May be labour intensive to achieve an accurate document
• May not be legally acceptable

PDF Image Only

PDF Image Only takes a bitmapped image of a document (like a TIF file) and applies a PDF wrapper to that raster image. Because PDF Image Only files do not contain OCR text, their content is not searchable. But the file can be integrated with other PDF documents and read by anyone on any platform with Adobe Acrobat Reader software. In addition, you can add keywords to the file, so you can search for the document later.

PDF Image Only is ideal for transactional documents, such as invoices and forms. For example, you can use Image Only to scan invoices into an imaging archive. Digital versions of invoices must be absolutely faithful to the originals, yet they are rarely retrieved once they have been entered into the system. When an invoice does need to be retrieved, it can easily be found with an index search for the invoice number or customer name.

PDF Searchable Image Exact and Compact

A document created in PDF Searchable Image offers the best of both worlds—an exact replica of the original document that is also fully searchable. PDF Searchable Image files contain two layers: a bitmapped layer and a hidden text layer. The bitmapped layer maintains the visual representation of the original document. The text layer contains the OCR version so you can search for any word on any page. PDF Searchable Image comes in two variants: Exact and Compact. These two are similar in many ways, but they have a few key differences.

Exact

The Exact version of PDF Searchable Image—also known as PDF Image+Text—is great for preserving your most richly coloured, intricately designed documents. This PDF flavour stores image information on one layer and maintains a text version of the document on another hidden layer, so you can easily search your documents.  The Exact option preserves colour as 8-bit to 24-bit files, so you can distinguish between shades of the same colour and between multiple colours on a page. The trade-off is a larger file size. PDF Searchable Image Exact may not be the best option where small file sizes are required. However, if you are archiving corporate information, your need for accurate, searchable files may outweigh concerns about file size. In that case, PDF Searchable Image Exact may be preferable.

Compact

PDF Searchable Image Compact uses a new colour-segmentation process to create small file sizes from certain types of colour documents. The Compact format is advantageous when the document you need to scan has some regions that are colour images and some regions that are monochrome (for example, text in any two colours).  With the Compact option, a page is segmented into two types of regions. Image (colour) regions are stored within the PDF file as JPEG data. Text (monochrome) regions are stored within the file as G4 or Zip compressed data.  Depending on how large the text regions are in the original document, this storage process can substantially reduce file size.  By producing smaller files, PDF Searchable Image Compact makes it easier for you to share your electronic documents, output them to printers, and post them on your Web site. 

PDF Formatted Text and Graphics

PDF Formatted Text and Graphics – also known as PDF Normal – is the most widely seen type of PDF file. It is the usual PDF output produced from a text processing or authoring environment, such as Microsoft Word. It contains the full text of the page with appropriate coding to define fonts, and font sizes, and so on. Many applications typically come with a ‘Save As PDF’ or ‘Print To PDF’ function, which allows the user to convert their documents to PDF Normal.

This format does not use bitmapped images but has true computer-generated text and graphics, using only one layer. This makes PDF Formatted Text and Graphics the most compact of the four PDF types.  Additionally, documents created in this type are fully searchable, and they look as good and print as well as files generated from software applications. Rather than viewing a scanned image, you see computer-generated text and graphics that scale and retain their crispness on-screen and in print.

If used for scanned documents, creating PDF Normal becomes significantly more complicated and expensive. The scanned image is not stored but is converted using OCR to text and graphics. This requires proofreading and correction.  The trade-off for these compact, high-quality files is that it takes more time and effort to ensure they are 100% error-free.  Because the image text is replaced with formatted text, PDF Formatted Text and Graphics files are easy to read. And because this type of PDF produces small file sizes, you can easily post these files to the Web or e-mail them to clients and colleagues.  PDF Formatted Text and Graphics is also ideal for out-of-print documents, such as rare books. Once scanned, the material can be converted to computer-generated text and graphics and then fine-tuned to produce a version of the original with identical or, in many cases, greatly improved quality.

PDF/A

PDF/A is a variation of PDF for document archiving and is known as International Standard ISO 19005. PDF/A is based on PDF release 1.4 (Acrobat 5.0).  We will discuss PDF/A in more detail in Part 3 of this series.

Next week we will examine the searchable PDF format in more detail, as well as the different variants available in that format.  Until then, Happy Scanning!