|
|
PDF/A Archiving Format
|
| PDF Portable Document Format |
| PDF Searchable Images |
| TWAIN Scanning |
| TIFF Tagged Image File Format |
PDF/A is an ISO Standard (ISO 19005-1:2005) for using PDF format for the long-term archiving
of electronic documents.
The Standard does not define an archiving strategy or the goals of an archiving
system. It identifies a file format for electronic documents that ensures the
documents can be reproduced the exact same way in years to come. A key element
to this reproducibility is the requirement for PDF/A documents to be 100 %
self-contained. All of the information necessary for displaying the document in
the same manner every time is embedded in the file including, but not limited
to, all content (text, raster images and vector graphics), fonts, and colour
information. A PDF/A document is not permitted to be reliant on information from
external sources such as fonts and hyperlinks.
Introduction
Adobe PDF is a universal file format
for document exchange that preserves all the fonts, formatting, colours, and
graphics of any source document (whether it’s on paper or from the Web or
other electronic sources). Preservation is faithful regardless of the
application and platform used to create or view the material. Adobe PDF files
can be shared, viewed, navigated, and printed on a broad range of operating
systems by anyone using free Adobe Acrobat Reader™ software.
Traditional archiving methods (such as paper and microfilm or microfiche)
guarantee reproducibility but are outdated for modern technology. Large
documents cannot be quickly sent around the globe and it is difficult to search
archived documents for specific content. TIFF guarantees reproducibility
in the long-term and has an established structure. TIFF is also easy to transmit
in a worldwide business environment but is not easily searchable. PDF can be a
more attractive archiving format than TIFF for a variety of reasons: PDF stores
structured objects (e.g. text, vector graphics, raster images), allowing for an
efficient full-text search in an entire archive; and metadata like title,
author, creation date, modification date, subject, keywords, etc. can be
embedded in a PDF file.
PDF files can therefore store textual data (such as word processed documents and
spreadsheets) and/or scanned documents.
PDF files can be automatically classified based on the
metadata, without requiring human intervention.
Scanned paper documents are stored in an image (rather than text) format.
TIFF is a file format commonly used for storing
digital versions of paper documents because it is a standard format for most
scanners and software applications. However, the advent of Adobe Portable
Document Format (PDF) has added new dimensions and powerful capabilities to
electronic documents because Adobe PDF is more extensible than other
image-based formats. Scanned Images can be stored in PDF files along with
textual information.
The inventor of the PDF Standard, Adobe Systems, publishes new versions of PDF frequently. Each new version has enriched the format with countless new features and has updated some of the older features. It was therefore necessary to define a stable derivative of the PDF format, based on Adobe’s proprietary PDF specification, that could be internationally accepted as a Standard for long-term electronic archiving. The result: PDF/A.
ISO 19005-1 defines “a file format based on PDF, known as PDF/A, which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rending the files.” (from ISO 19005-1). The Standard does not define an archiving strategy or the goals of an archiving system. It identifies a “profile” for electronic documents that ensures the documents can be reproduced in years to come.
A key element to this reproducibility is the requirement for PDF/A documents to be 100 % self-contained. All of the information necessary for displaying the document in the same manner every time is embedded in the file. This includes all visible content like text, raster images, vector graphics, fonts, colour information and much more. A PDF/A document however is not permitted to be reliant on any information from direct or indirect external sources, for example links to external image files or font that are not embedded.
PDF in its native form cannot guarantee long-term reproducibility and not even the “WYSISYG” (what you see is what you get) principle. Certain restrictions and amendments had to be incorporated into the Standard. To be accepted, PDF/A needed to be based on an existing version of the PDF Reference and not on anticipated functionality in a future version. The ISO chose the Adobe PDF Reference 1.4, which Adobe implemented in Acrobat 5, as the basis for the Standard. The ISO Standard states that PDF/A “shall adhere to all requirements of PDF Reference as modified by this part of ISO 19005”. The Standard itself identifies only differences with respect to the PDF Reference. In order to fully understand PDF/A, you have to also understand the PDF Reference 1.4.
Certain functionality allowed in PDF 1.4 has been specifically excluded from
PDF/A, for example transparency and sound and movie actions. There are also
elements described in the PDF Reference 1.4 that are not mandatory. PDF/A on the
other hand requires these elements to be implemented, for example embedded
fonts.
In short, PDF/A is based on the Adobe PDF Reference 1.4 (Acrobat
5), with specific
features being either mandatory, recommended, restricted, or prohibited.
PDF/A-1 files must include:
PDF/A-1 files may not include:
PDF/A has been established as a set of standards with several parts.
Currently only PDF/A-1 (Part 1) has been approved. PDF/A-1 is further subdivided into two levels of compliance: PDF/A-1a and
PDF/A-1b
PDF/A-1a (Level A Conformance) denotes full compliance with the currently approved PDF/A Standard ISO 19005-1: Part 1. In
addition to exact visual reproduction it also includes mapping text to Unicode
and structuring of the document content
PDF/A-1b (Level B Conformance) is a “minimal compliance” level for
PDF/A. PDF/A-1b requirements are meant to ensure that the rendered visual appearance of the file is reproducible over the long-term. It
requires exact visual reproduction only.
PDF/A-1a and PDF/A-1b differ primarily with respect to text extraction.
The difference between PDF/A-1a and -1b has no impact for scanned documents, provided the files have not been enhanced by means of OCR for
searching (PDF/A Searchable Files).
PDF/A-2 - Future Development of PDF/A.
A new part to the standard, ISO 19005-1, Part-2 (PDF/A-2), is currently being worked on by the
ISO Technical Committee. PDF/A-2 will address some of the new feature added with versions 1.5, 1.6 and 1.7 of the PDF Reference. PDF/A-2 should be backwards compatible, i.e. all valid PDF/A-1 documents should also be compliant with PDF/A-2. However PDF/A-2 compliant files will not necessarily be PDF/A-1 compliant.
PDF/A is only part of a complete archiving solution. PDF/A alone does not guarantee long-term archiving and it does not guarantee that information will be displayed as desired. PDF/A also does not claim that a PDF/A-based archive is always the best solution. However, it you decide to use PDF, then PDF/A defines a set of requirements that make long-term archiving possible.
Other aspects that must be taken into account when implementing a PDF/A-compliant archive include, for example, corporate standards and procedures, reliable data sources, reliable fonts, quality management and special individual requirements. The migration of current
paper -or TIFF- based archives to PDF/A compliant archives is not an insignificant task and must be well planned.
Both Microsoft (Office 2007) and OpenOffice (from release 2.4) are adding PDF/A export to their office software.
See also PDF Searchable
* PDF/A Archiving Format * PDF Searchable Images * PDF/A Archiving Standard * PDF/A Searchable Images * PDF/A Files