PDF

File Format
Name	PDF
Ontology	Electronic File Formats Document PDF ; ; ;
Extension(s)	.pdf
MIME Type(s)	application/pdf
LoCFDD	fdd000030, others
PRONOM	fmt/276, others
Wikidata ID	Q42332

Latest revision as of 14:09, 12 November 2024

Portable Document Format (PDF) is a document file format originally from Adobe, based on PostScript. It has many subsets.

As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.

PDF profiles (formalized subsets) include the following:

PDF/A (optimized for preservation)
- PDF/A-1 (ISO 19005-1:2005)
- PDF/A-2 (ISO 19005-2:2011)
- PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
- PDF/A-4 (ISO 19005-4:2020)
PDF/E (ISO 24517-1:2008) (for engineering workflows)
PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
PDF/VT (ISO 16612-2) (support for variable document printing)
PDF/X (support for prepress graphics exchange)
- PDF/X-1 (ISO 15930-1:2001)
- PDF/X-1a (ISO 15930-4:2003)
- PDF/X-2 (ISO 15930-5:2003)
- PDF/X-3 (ISO 15930-6:2003)
Tagged PDF

Some scanner documentation references an apparently fictitious "PDF/L" profile (see Gary McGath's "PDF/L?").

A PDF 2.0 spec (ISO 32000-2) was published in 2017-07, with some new features as well as clarification of conformance with existing features.

A PDF/raster draft spec was issued in 2017 as a subset of PDF files containing raster images of scanned documents.

[edit] Identifiers

Format	PRONOM	LoCFDD
PDF		fdd000030
PDF 1.0	fmt/14	fdd000316
PDF 1.1	fmt/15
PDF 1.2	fmt/16
PDF 1.3	fmt/17
PDF 1.4	fmt/18	fdd000122
PDF 1.5	fmt/19	fdd000123
PDF 1.6	fmt/20	fdd000276
PDF 1.7	fmt/276	fdd000277
PDF 1.7, Ext. 3		fdd000313
PDF 2.0	fmt/1129
PDF/A		fdd000318
PDF/A-1		fdd000125
PDF/A-1a	fmt/95	fdd000251
PDF/A-1b	fmt/354	fdd000252
PDF/A-2		fdd000319
PDF/A-2a	fmt/476	fdd000320
PDF/A-2b	fmt/477	fdd000322
PDF/A-2u	fmt/478	fdd000321
PDF/A-3a	fmt/479	fdd000360
PDF/A-3b	fmt/480
PDF/A-3u	fmt/481
PDF/A-4		fdd000532
PDF/X-1	fmt/144, fmt/145	fdd000124
PDF/X-1a	fmt/157, fmt/146
PDF/X-2	fmt/147
PDF/X-3	fmt/158, fmt/148
PDF/X-4	fmt/488
PDF/X-4p	fmt/489
PDF/X-5g	fmt/490
PDF/X-5pg	fmt/491
PDF/X-5n	fmt/492
PDF/UA-1		fdd000350
PDF/E-1	fmt/493
PDF, Geospatial		fdd000315
GeoPDF 2.2		fdd000312
PDF Portfolio	fmt/1451

[edit] Identification

The majority of PDF files can be identified by a fixed header e.g. "%PDF-1.4", however, older documents have a number of variations.

Some can start with "%!PS-Adobe-N.n PDF-M.m" instead, as described here.
Since PDF 1.7, the major and minor version numbers have been fixed. i.e. the public version from Adobe after 1.7 was "1.7 Adobe Extension Level 3".
For the PDF/A families of formats, their conformance is declared via an embedded (XMP) metadata fragment.
Some older files from Mac OS may be wrapped up in the AppleSingle/AppleDouble formats. This is a general issue, so should perhaps be documented elsewhere. For more information, see:
- http://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats
- http://tools.ietf.org/rfc/rfc1740.txt

[edit] Compression

Images in PDF documents may use the following compression schemes:

LZW
Flate (zlib)
RunLength
CCITTFax (CCITT Group 3 and CCITT Group 4)
JBIG2
DCT (JPEG)
JPX (part of the JPEG 2000 standard)

[edit] Digital Rights Management & Encryption

PDF has two types of 'encryption' - it uses an 'user' password to limit the ability to open the document, and a 'creator' password to limit other rights, like printing, copying, etc. The former case, where a password is required to open the file, is the main preservation concern, as our users will not be able to open a PDF encrypted in this way (unless the password can be cracked, which may be problematic both technically and legally). However, the latter case causes problems, because the PDF is encrypted here too, but with a special known user password of "" (an empty string, which is not the same as no password). So, the document is encrypted in both cases, and you can only tell which is which by attempting to decrypt the PDF using the special default password "". Some PDF analysis tools (notably JHOVE) do not implement the relevant decryption workflow, and so cannot distinguish between the two types of encryption.

An example of the decryption test workflow can be found here: https://gist.github.com/anjackson/5237071

Some of the most locked-up PDFs anywhere can be found at the ANSI IBR Standards Portal, which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.

A "Protected PDF" (PPDF) format is reportedly used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.

[edit] Document redaction

Occasionally the attempts of technically-inept users to obscure content in PDF files get in the news. People have sometimes had the mistaken impression that if a section of text is overlayed with a solid-black shape, or set to white-on-white text, or some such thing, before the publicly distributed document is sent out, that would make the redacted sections unavailable; this is not true, as it is in fact easy to find text that has been obscured in such manners, often as simple as dragging a mouse over it to highlight it. This happened in a 2018 Florida case connected with the school shooting there, where some parts of the school district's report about the shooter were badly redacted and disclosed by a local newspaper, leading to a judge threatening punishment of the paper and prior restraint of future publications of theirs because of this "hacking", raising all sorts of legal and constitutional issues.

[edit] Web linking

When linked on the Web, specific pages of a PDF can be referenced by appending #page=N (where N is the desired page number) as a fragment identifier at the end of the URL. This is a little-known fact.

[edit] Specifications

Adobe PDF References (archived) Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
PDF Specification Archive
Other sources of the above documents:
- PDF Reference and Adobe Extensions to the PDF Specification Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using U3D and PRC formats.
- Adobe PDF Reference Archives. Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
ISO 32000-1:2008: PDF 1.7 (not free to download)
ISO 32000-2:2017: PDF 2.0 (not free to download)
Draft PDF/raster spec 1.0

[edit] Software

Adobe Reader views PDF files, either as a standalone program or a browser plugin.
Firefox 19.0 includes a built-in PDF reader.
Tabula: convert tabular data in PDFs to CSV
mPDF: convert HTML to PDF
MuPDF PDF viewer and command line mutool for manipulating PDF
peepdf: powerful Python tool to analyze and explore PDFs
PDF24 creator
Apache PDFBox is an open-source PDF library that includes a PDF/A validator
pdfium: Open source PDF rendering engine
Textract: extract text from various document formats including PDF
pdf2svg (in JavaScript)
Programming with PDFMiner
PDFBox PDF/A Validator
PyPDF2
Sumatra PDF Reader
PDF viewer for Chrome
veraPDF library (PDF validator)
PDFx - Extract metadata and URLs from PDFs, and download all referenced PDFs
Caradoc: PDF parser and validator
PBDF: Create documents that are simultaneously valid PDF, HTML, and VirtualBox OVA.
PDF Tools
PDF-XChange Viewer
The PDF Toolkit PDFTK
Ghostscript (can both read and write PDF)

[edit] Online utilities

[edit] Sample files

PDF Cabinet of Horrors - sample PDF files in corrupted or otherwise problematic formats
Adobe PDF Test Suites - various PDF test suites on Adobe Acrobat Engineering site
Homeland by Cory Doctorow
Sample document saved from Windows Word 2007
Quine PDF; contains its own TeX source
Newsletter designed to work as PDF, ZIP, or shell script
veraPDF corpus
Horrifying PDF Experiments
Test PDFs used by Mozilla PDF Reader
PDF 2.0 example files by the PDF Association
dexvert samples — document/pdf

[edit] See also

Ascii85
FDF
KFP Preflight Profile
PostScript
WWF
XFDF

[edit] Links

[edit] Format info

Portable Document Format (Wikipedia)
Forensics Wiki: PDF
Adobe Acrobat Engineering site - Dedicated Adobe site with lots of technical information, including a history of PDF and Acrobat, conforming viewers and test files.
PDF/A in a Nutshell 2.0 – online edition
Inside the PDF File Format
PDF101 an Adobe document walkthrough

[edit] Validation

PDF Validation: Dream or Yawn? - Presentation on possibilities of an open-source PDF validator
The pitfalls of protocol design: Attempting to write a formally verified PDF parser
New open-source file validation project

[edit] Jailbreaking

[edit] Commentary

The Network is the Format: PDF and the Long-term Use of Digital Content Article by Sheila Morrissey of ITHAKA on the challenges of preserving PDF files based on experience. She illustrates the challenge of defining a "sufficient sub-graph of the network of information about a digital object, for effective future use."
The PDF’s Place in a History of Paper Knowledge: An Interview with Lisa Gitelman
Portable Document Format on OPF File Format Risk Registry - Lists various long-term accessibility issues in PDF and how to detect them using Apache Preflight.
Adobe Portable Document Format - Inventory of long-term preservation risks - Report by KB/ National Library of the Netherlands.
The uses and abuses of PDF
Apple’s Preview: Still not safe for work
Preserving the Grey Literature Explosion: PDF/A and the Digital Archive
Ensuring long-term access: PDF validation with JHOVE?
Researchers: it's time to ditch the PDF
PDF Format Preservation Assessment (British Library)
What will PDF 2.0 bring?
The Benefits and Risks of the PDF/A-3 file format for archival institutions
Becoming of Age: PDF (comic)
What does "support PDF" really mean?
PDF/A as a preferred, sustainable format for spreadsheets?
What's so hard about PDF text extraction?
Perfecting PDF lexical analysis

[edit] Miscellaneous

@@ Line 4: / Line 4: @@
 |extensions={{ext|pdf}}
 |mimetypes={{mimetype|application/pdf}}
-|locfdd={{LoCFDD|fdd000146}}, others
+|locfdd={{LoCFDD|fdd000030}}, others
 |pronom={{PRONOM|fmt/276}}, others
+|wikidata={{wikidata|Q42332}}
 }}
 '''Portable Document Format''' ('''PDF''') is a document file format originally from Adobe, based on [[PostScript]]. It has many subsets.
@@ Line 17: / Line 18: @@
 ** PDF/A-2 (ISO 19005-2:2011)
 ** PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
+** PDF/A-4 (ISO 19005-4:2020)
 * PDF/E (ISO 24517-1:2008) (for engineering workflows)
 * PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
@@ Line 26: / Line 28: @@
 ** PDF/X-3 (ISO 15930-6:2003)
 * Tagged PDF
+Some scanner documentation references an apparently fictitious "PDF/L" profile (see Gary McGath's [https://madfileformatscience.garymcgath.com/2018/03/21/pdf-l/ "PDF/L?"]).
-A PDF 2.0 spec is under development with some new features as well as clarification of conformance with existing features.
+A PDF 2.0 spec (ISO 32000-2) was published in 2017-07, with some new features as well as clarification of conformance with existing features.
+A PDF/raster draft spec was issued in 2017 as a subset of PDF files containing raster images of scanned documents.
 == Identifiers ==
@@ Line 35: / Line 40: @@
 ! LoCFDD
 |-
-|PDF ||   || {{LoCFDD|fdd000146}}
+|PDF ||   || {{LoCFDD|fdd000030}}
 |-
 |PDF 1.0 || {{PRONOM|fmt/14}} ||rowspan="4"| {{LoCFDD|fdd000316}}
@@ Line 54: / Line 59: @@
 |-
 |PDF 1.7, Ext. 3 ||   || {{LoCFDD|fdd000313}}
+|-
+|PDF 2.0 || {{PRONOM|fmt/1129}}
 |-
 |PDF/A    ||   || {{LoCFDD|fdd000318}}
@@ Line 76: / Line 83: @@
 |-
 |PDF/A-3u || {{PRONOM|fmt/481}}
+|-
+|PDF/A-4 || || {{LoCFDD|fdd000532}}
 |-
 |PDF/X-1  || {{PRONOM|fmt/144}}, {{PRONOM|fmt/145}} ||rowspan="9"| {{LoCFDD|fdd000124}}
@@ Line 102: / Line 111: @@
 |-
 |GeoPDF 2.2 ||   || {{LoCFDD|fdd000312}}
+|-
+|PDF Portfolio   || {{PRONOM|fmt/1451}}
 |}
@@ Line 131: / Line 142: @@
 A "Protected PDF" (PPDF) format is [http://www.eweek.com/mobile/microsoft-enterprise-mobility-suite-cozies-up-to-office.html reportedly] used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.
+== Document redaction ==
+Occasionally the attempts of technically-inept users to obscure content in PDF files get in the news. People have sometimes had the mistaken impression that if a section of text is overlayed with a solid-black shape, or set to white-on-white text, or some such thing, before the publicly distributed document is sent out, that would make the redacted sections unavailable; this is not true, as it is in fact easy to find text that has been obscured in such manners, often as simple as dragging a mouse over it to highlight it. This happened in a [http://www.sun-sentinel.com/opinion/fl-op-editorial-judge-elizabeth-scherer-20180823-story.html 2018 Florida case] connected with the school shooting there, where some parts of the school district's report about the shooter were badly redacted and disclosed by a local newspaper, leading to a judge threatening punishment of the paper and prior restraint of future publications of theirs because of this "hacking", raising all sorts of legal and constitutional issues.
+== Web linking ==
+When linked on the [[Web]], specific pages of a PDF can be referenced by appending <code>#page=N</code> (where N is the desired page number) as a fragment identifier at the end of the [[URL]]. This is a little-known fact.
 == Specifications ==
-* [http://acroeng.adobe.com/wp/?page_id=321 Adobe PDF References]  Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
+* [https://web.archive.org/web/20150228065316/http://acroeng.adobe.com/wp/?page_id=321 Adobe PDF References (archived)]  Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
+* [https://pdfa.org/resource/pdf-specification-archive/ PDF Specification Archive]
 * Other sources of the above documents:
-** [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[PRC]] formats.
+** [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[Adobe PRC|PRC]] formats.
 ** [http://www.adobe.com/devnet/pdf/pdf_reference_archive.html Adobe PDF Reference Archives.] Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
+* [https://www.iso.org/standard/51502.html ISO 32000-1:2008]: PDF 1.7 (not free to download)
+* [https://www.iso.org/standard/63534.html ISO 32000-2:2017]: PDF 2.0 (not free to download)
+* [https://pdfraster.org/wp-content/uploads/2017/06/PDFraster10_June-2017.pdf Draft PDF/raster spec 1.0]
 == Software ==
@@ Line 143: / Line 166: @@
 * [http://source.mozillaopennews.org/en-US/articles/introducing-tabula/ Tabula: convert tabular data in PDFs to CSV]
 * [http://www.mpdf1.com/mpdf/index.php mPDF: convert HTML to PDF]
+* [https://mupdf.com/ MuPDF PDF viewer and command line mutool for manipulating PDF]
+* [https://eternal-todo.com/tools/peepdf-pdf-analysis-tool peepdf: powerful Python tool to analyze and explore PDFs]
 * [http://en.pdf24.org/ PDF24 creator]
 * [http://pdfbox.apache.org/ Apache PDFBox] is an open-source PDF library that includes a PDF/A validator
@@ Line 156: / Line 181: @@
 * [http://www.metachris.com/pdfx/ PDFx - Extract metadata and URLs from PDFs, and download all referenced PDFs]
 * [https://github.com/ANSSI-FR/caradoc Caradoc: PDF parser and validator]
+* [https://github.com/uds-datalab/PDBF PBDF: Create documents that are simultaneously valid PDF, HTML, and VirtualBox OVA.]
+* [https://blog.didierstevens.com/programs/pdf-tools/ PDF Tools]
+* [https://www.tracker-software.com/product/pdf-xchange-viewer PDF-XChange Viewer]
+* [https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ The PDF Toolkit PDFTK]
+* [https://www.ghostscript.com/doc/Readme.htm Ghostscript (can both read and write PDF)]
 == Online utilities ==
 * [http://www.pdf4kindle.com/ PDF to Kindle converter]
 * [https://pdftables.com/ PDF to Excel (and some other formats)]
+* [https://www.ilovepdf.com/ I Love PDF: miscellaneous utilities]
 == Sample files ==
@@ Line 165: / Line 196: @@
 * [http://acroeng.adobe.com/wp/?page_id=10 Adobe PDF Test Suites] - various PDF test suites on Adobe Acrobat Engineering site
 * [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.pdf Homeland by Cory Doctorow]
-* [http://www.dan.info/sampledata/msword/testing.pdf Sample document saved from Windows Word 2007]
+* [https://www.dan.info/sampledata/msword/testing.pdf Sample document saved from Windows Word 2007]
 * [https://github.com/corkami/pocs/blob/master/pdf/quine.pdf Quine PDF; contains its own TeX source]
 * [https://www.alchemistowl.org/pocorgtfo/pocorgtfo08.pdf Newsletter designed to work as PDF, ZIP, or shell script]
 * [https://github.com/veraPDF/veraPDF-corpus veraPDF corpus]
 * [https://github.com/osnr/horrifying-pdf-experiments Horrifying PDF Experiments]
+* [https://github.com/mozilla/pdf.js/tree/master/test/pdfs Test PDFs used by Mozilla PDF Reader]
+* [https://github.com/pdf-association/pdf20examples PDF 2.0 example files by the PDF Association]
+* {{DexvertSamples|document/pdf}}
 == See also ==
 * [[Ascii85]]
 * [[FDF]]
+* [[KFP]] Preflight Profile
 * [[PostScript]]
+* [[WWF]]
 * [[XFDF]]
@@ Line 181: / Line 217: @@
 === Format info ===
 * [http://en.wikipedia.org/wiki/Portable_Document_Format Portable Document Format (Wikipedia)]
-* [http://www.forensicswiki.org/wiki/PDF Forensics Wiki: PDF]
+* [{{ForensicsWikiURL|pdf}} Forensics Wiki: PDF]
 *[http://acroeng.adobe.com/wp/ Adobe Acrobat Engineering site] - Dedicated Adobe site with lots of technical information, including a history of PDF and Acrobat, conforming viewers and test files.
 * [http://www.pdfa.org/2013/04/pdfa-in-a-nutshell-2_0/ PDF/A in a Nutshell 2.0 – online edition]
@@ Line 212: / Line 248: @@
 * [http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_PDF_A3_report_final022014.pdf?loclr=blogsig The Benefits and Risks of the PDF/A-3 file format for archival institutions]
 * [https://nicolastreeten.wordpress.com/2015/09/19/becoming-of-age-pdf/ Becoming of Age: PDF (comic)]
+* [http://www.pdfa.org/2016/06/what-does-support-pdf-really-mean/ What does "support PDF" really mean?]
+* [http://openpreservation.org/blog/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets/ PDF/A as a preferred, sustainable format for spreadsheets?]
+* [https://www.filingdb.com/pdf-text-extraction What's so hard about PDF text extraction?]
+* [https://www.pdfa.org/perfecting-pdf-lexical-analysis/ Perfecting PDF lexical analysis]
 === Miscellaneous ===
@@ Line 235: / Line 275: @@
 * [https://speakerdeck.com/ange/lets-write-a-pdf-file Let's write a PDF file]
 * [https://blog.didierstevens.com/2016/06/07/recovering-a-ransomed-pdf/ Recovering a ransomed PDF]
+* [https://github.com/digital-preservation/droid/issues/114 PDF version numbers based on deprecated mechanism]
+* [https://madfileformatscience.garymcgath.com/2016/09/26/pdf-version/ Figuring out the PDF version is harder than you think]
+* [https://www.pdfa.org/slides-and-video-recordings-of-the-pdf-days-europe-2017/ Slides and video recordings of the PDF Days Europe 2017]
+* [https://www.pcworld.com/article/2096946/5-cheaper-alternatives-to-acrobat-for-pdf-editing.html 5 cheaper alternatives to Acrobt for PDF editing]
+* [https://pdfraster.org/ PDF/raster site]
+* [https://www.pdfa.org/hunter-bidens-email-and-the-potential-for-deepfakes-with-pdf/ Hunter Biden’s “email” and the potential for deepfakes with PDF]
+* [https://www.bitsgalore.org/2021/09/06/pdf-processing-and-analysis-with-open-source-tools PDF processing and analysis with open-source tools]
+* [https://www.wowsignal.io/articles/pdf PDF cannot be tokenized]
 [[Category:Page description languages]]
 [[Category:Adobe]]

PDF

Latest revision as of 14:09, 12 November 2024

Contents

[edit] Identifiers

[edit] Identification

[edit] Compression

[edit] Digital Rights Management & Encryption

[edit] Document redaction

[edit] Web linking

[edit] Specifications

[edit] Software

[edit] Online utilities

[edit] Sample files

[edit] See also

[edit] Links

[edit] Format info

[edit] Validation

[edit] Jailbreaking

[edit] Commentary

[edit] Miscellaneous

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox

File Format
Name	PDF
Ontology	Electronic File Formats Document PDF
Extension(s)	`.pdf`
MIME Type(s)	application/pdf
LoCFDD	fdd000030, others
PRONOM	fmt/276, others
Wikidata ID	Q42332