Searchable PDF
Home Up

 

PDF Searchable Images                        PDF Searchable text

PDF Portable Document Format
PDF/A Archiving Format
TIFF Tagged Image File Format
TWAIN Scanning

Introduction

Scanned paper documents are stored in an image (rather than text) format.

TIFF is a file format commonly used for storing digital versions of paper documents because it is a standard format for most scanners and software applications. However, the advent of Portable Document Format (PDF) has added new dimensions and powerful capabilities to electronic documents because PDF is more extensible than other image-based formats.

PDF (Portable Document Format) is a universal file format for document exchange that preserves all the fonts, formatting, colours, and graphics of any source document (whether it’s on paper or from the Web or other electronic sources). Preservation is faithful regardless of the application and platform used to create or view the material. PDF files can be shared, viewed, navigated, and printed on a broad range of operating systems by anyone using free Adobe Acrobat Reader™ or other software.

With scanning software, volumes of legacy paper documents may be converted to PDF so you can search, annotate, publish, and archive all of your information in a digital environment. 

However there are different types of PDF for use when scanning paper-based documents:

PDF Image Only
PDF Searchable

PDF Image Only

PDF Image Only is the simplest scanning for documents that don’t require searchable text

PDF Image Only takes a bitmapped image of a document (like a TIF file) and applies a PDF wrapper to that raster image. Because PDF Image Only files do not contain OCR text, their content is not searchable. But the file can be integrated with other Adobe PDF documents and read by anyone on any platform with Adobe Acrobat Reader software. In addition, you can add keywords to the file, so you can search for the document later.

PDF Image Only is ideal for transactional documents, such as invoices and forms. For example, you can use Image Only to scan invoices into an imaging archive. Digital versions of invoices must be absolutely faithful to the originals, yet they are rarely retrieved once they have been entered into the system. When an invoice does need to be retrieved, it can easily be found with an index search for the invoice number or customer name.

PDF Searchable 

PDF Searchable Image is a PDF Image Only document with the addition of a text layer beneath the image. This approach retains the look of the original page while enabling text searchability.

A document created in PDF Searchable Image offers the best of both worlds—an exact replica of the original document that is also fully searchable. PDF Searchable Image files contain two layers: a bitmapped (image) layer and a hidden text layer. The bitmapped layer maintains the visual representation of the original document. The text layer contains the Optical Character Recognition (OCR) version so you can search for any word on any page. PDF Searchable Image comes in two variants: Exact and Compact. These two are similar in many ways, but they have a few key differences.

PDF Searchable Image Exact

The Exact version of PDF Searchable Image—also known as PDF Image+Text—is great for preserving your most richly coloured, intricately designed documents. This PDF flavour stores image information on one layer and maintains a text version of the document on another hidden layer, so you can easily search your documents. 
The Exact option preserves
colour as 8-bit to 24-bit files, so you can distinguish between shades of the same colour and between multiple colours on a page. The trade-off is a larger file size. So if you plan to post your files to your intranet or e-mail them to co-workers around the globe, PDF Searchable Image Exact may not be the best option. However, if you are archiving your corporate data for later use, your need for accurate, searchable files may outweigh concerns about file size. In that case, PDF Searchable Image Exact may be preferable.

PDF Searchable Image Exact is the format normally used for searchable PDF scanning and is often referred to simply as PDF Searchable.

PDF Searchable Image Compact

PDF Searchable Image Compact uses a new colour-segmentation process to create small file sizes from certain types of colour documents. The Compact format is advantageous when the document you need to scan has some regions that are colour images and some regions that are monochrome (for example, text in any two colours).
When you choose the Compact option, software should automatically segment the page into two types of regions. Image (colour) regions are stored within the PDF file as JPEG data. Text (monochrome) regions are stored within the file as G4 or Zip compressed data.
Depending on how large the text regions are in the original document, this storage process can substantially reduce file size. For example yellow text on a blue background that would otherwise be saved as 8-bit to 24-bit colour can now be saved as 1-bit colour.

By producing smaller files, PDF Searchable Image Compact makes it easier for you to share your electronic documents, output them to printers, and post them on your Web site. The Compact option works best for documents that have either a few colours or colours that are distinct from one another. For example, corporate letterhead is a good candidate for PDF Searchable Image Compact because logos with limited colour that would otherwise have to be saved as large, 8-bit images can be saved as 1-bit images.

Text Accuracy 

The OCR process required to create PDF Searchable Image typically provides text accuracy of 97 to 99 percent. One to three wrong characters for every 100 may seem like a lot errors. But this is not a problem for those applications that this approach is designed for. Since the user sees a scanned image representation of the original paper page, OCR errors will not be visible to the eye. The errors are only an issue when searching or copying text, which accesses the hidden text layer.
If a higher accuracy level is desired, the document will have to manually proofread and corrected.

 

 

See also PDFPDF/A archiving format

Alliance BatchScan can scan into PDF Searchable format

 

* PDF Searchable Images * PDF Scanned Images * Scanning PDF Searchable Images * PDF Searchable File * PDF Searchable OCR * 

Home Records Management Document Management Imaging for Windows Document Scanning Document Scanners Barcoding Software