PDF/A Archiving Format
Home Software Hardware Services News

 

PDF/A Archiving Format                        PDF Searchable text

PDF Portable Document Format
PDF Searchable Images
TWAIN Scanning
TIFF Tagged Image File Format

PDF/A is an ISO Standard (ISO 19005-1:2005) for using PDF format for the long-term archiving of electronic documents.
The Standard does not define an archiving strategy or the goals of an archiving system. It identifies a file format for electronic documents that ensures the documents can be reproduced the exact same way in years to come. A key element to this reproducibility is the requirement for PDF/A documents to be 100 % self-contained. All of the information necessary for displaying the document in the same manner every time is embedded in the file including, but not limited to, all content (text, raster images and vector graphics), fonts, and colour information. A PDF/A document is not permitted to be reliant on information from external sources such as fonts and hyperlinks.

Introduction

Adobe PDF is a universal file format for document exchange that preserves all the fonts, formatting, colours, and graphics of any source document (whether it’s on paper or from the Web or other electronic sources). Preservation is faithful regardless of the application and platform used to create or view the material. Adobe PDF files can be shared, viewed, navigated, and printed on a broad range of operating systems by anyone using free Adobe Acrobat Reader™ software.

Traditional archiving methods (such as paper and microfilm or microfiche) guarantee reproducibility but are outdated for modern technology. Large documents cannot be quickly sent around the globe and it is difficult to search archived documents for specific content.  TIFF guarantees reproducibility in the long-term and has an established structure. TIFF is also easy to transmit in a worldwide business environment but is not easily searchable. PDF can be a more attractive archiving format than TIFF for a variety of reasons: PDF stores structured objects (e.g. text, vector graphics, raster images), allowing for an efficient full-text search in an entire archive; and metadata like title, author, creation date, modification date, subject, keywords, etc. can be embedded in a PDF file. 
PDF files can therefore store textual data (such as word processed documents and spreadsheets) and/or scanned documents.
PDF files can be automatically classified based on the metadata, without requiring human intervention.

Scanned paper documents are stored in an image (rather than text) format.
TIFF is a file format commonly used for storing digital versions of paper documents because it is a standard format for most scanners and software applications. However, the advent of Adobe Portable Document Format (PDF) has added new dimensions and powerful capabilities to electronic documents because Adobe PDF is more extensible than other image-based formats. Scanned Images can be stored in PDF files along with textual information.

The inventor of the PDF Standard, Adobe Systems, publishes new versions of PDF frequently. Each new version has enriched the format with countless new features and has updated some of the older features. It was therefore necessary to define a stable derivative of the PDF format, based on Adobe’s proprietary PDF specification, that could be internationally accepted as a Standard for long-term electronic archiving. The result: PDF/A.

The PDF/A-Standard

ISO 19005-1 defines “a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rending the files.” (from ISO 19005-1). The Standard does not define an archiving strategy or the goals of an archiving system. It identifies a “profile” for electronic documents that ensures the documents can be reproduced in years to come.

A key element to this reproducibility is the requirement for PDF/A documents to be 100 % self-contained. All of the information necessary for displaying the document in the same manner every time is embedded in the file. This includes all visible content like text, raster images, vector graphics, fonts, colour information and much more. A PDF/A document however is not permitted to be reliant on any information from direct or indirect external sources, for example links to external image files or font that are not embedded.

PDF vs PDF/A

PDF in its native form cannot guarantee long-term reproducibility and not even the “WYSISYG” (what you see is what you get) principle. Certain restrictions and amendments had to be incorporated into the Standard. To be accepted, PDF/A needed to be based on an existing version of the PDF Reference and not on anticipated functionality in a future version. The ISO chose the Adobe PDF Reference 1.4, which Adobe implemented in Acrobat 5, as the basis for the Standard. The ISO Standard states that PDF/A “shall adhere to all requirements of PDF Reference as modified by this part of ISO 19005”. The Standard itself identifies only differences with respect to the PDF Reference. In order to fully understand PDF/A, you have to also understand the PDF Reference 1.4.

Certain functionality allowed in PDF 1.4 has been specifically excluded from PDF/A, for example transparency and sound and movie actions. There are also elements described in the PDF Reference 1.4 that are not mandatory. PDF/A on the other hand requires these elements to be implemented, for example embedded fonts. 

In short, PDF/A is based on the Adobe PDF Reference 1.4 (Acrobat 5), with specific features being either mandatory, recommended, restricted, or prohibited.

PDF/A-1 files must include:

  1. • Embedded fonts
  2. • Device-independent colour
  3. • XMP metadata

PDF/A-1 files may not include:

  1. • Encryption
  2. • LZW Compression
  3. • Embedded files
  4. • External content references
  5. • PDF Transparency
  6. • Multi-media
  7. • JavaScript

The PDF/A, A-1a, A-1b, and A-2 Standards

PDF/A has been established as a set of standards with several parts. 

Currently only PDF/A-1 (Part 1) has been approved. PDF/A-1 is further subdivided into two levels of compliance: PDF/A-1a and PDF/A-1b

PDF/A-1a
(Level A Conformance) denotes full compliance with the currently approved PDF/A Standard ISO 19005-1: Part 1. In addition to exact visual reproduction it also includes mapping text to Unicode and structuring of the document content

PDF/A-1b (Level B Conformance) is a “minimal compliance” level for PDF/A. PDF/A-1b requirements are meant to ensure that the rendered visual appearance of the file is reproducible over the long-term. It requires exact visual reproduction only.

PDF/A-1a and PDF/A-1b differ primarily with respect to text extraction. 
The difference between PDF/A-1a and -1b has no impact for scanned documents, provided the files have not been enhanced by means of OCR for searching (PDF/A Searchable Files).

PDF/A-2 - Future Development of PDF/A. A new part to the standard, ISO 19005-1, Part-2 (PDF/A-2), is currently being worked on by the ISO Technical Committee. PDF/A-2 will address some of the new feature added with versions 1.5, 1.6 and 1.7 of the PDF Reference. PDF/A-2 should be backwards compatible, i.e. all valid PDF/A-1 documents should also be compliant with PDF/A-2. However PDF/A-2 compliant files will not necessarily be PDF/A-1 compliant.

PDF/A requires a complete solution

PDF/A is only part of a complete archiving solution. PDF/A alone does not guarantee long-term archiving and it does not guarantee that information will be displayed as desired. PDF/A also does not claim that a PDF/A-based archive is always the best solution. However, it you decide to use PDF, then PDF/A defines a set of requirements that make long-term archiving possible. 

Other aspects that must be taken into account when implementing a PDF/A-compliant archive include, for example, corporate standards and procedures, reliable data sources, reliable fonts, quality management and special individual requirements. The migration of current paper -or TIFF- based archives to PDF/A compliant archives is not an insignificant task and must be well planned. 

Both Microsoft (Office 2007) and OpenOffice (from release 2.4) are adding PDF/A export to their office software.

See also PDF Searchable

* PDF/A Archiving Format * PDF Searchable Images * PDF/A Archiving Standard  * PDF/A Searchable Images * PDF/A Files