EPUB

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Description: disambig)
 
(56 intermediate revisions by 7 users not shown)
Line 3: Line 3:
 
|subcat=Electronic Publishing formats
 
|subcat=Electronic Publishing formats
 
|extensions={{ext|epub}}
 
|extensions={{ext|epub}}
 +
|mimetypes={{mimetype|application/epub+zip}}
 +
|pronom={{PRONOM|fmt/483}}
 +
|locfdd={{LoCFDD|fdd000310}}
 +
|wikidata={{wikidata|Q27196933}}
 
}}
 
}}
  
Line 14: Line 18:
  
 
The intent of ePub is to serve both as a source file format and an end user format. For this reason the files are collected into a container for easy dissemination and use. This container is generally a zip file but the extension has been renamed to .epub. It has special requirements by including an uncompressed mime type file while the rest of the data in the file is compressed. An ePub reader should be capable of reading the content in its compressed format. <sup>http://wiki.mobileread.com/wiki/EPUB</sup>
 
The intent of ePub is to serve both as a source file format and an end user format. For this reason the files are collected into a container for easy dissemination and use. This container is generally a zip file but the extension has been renamed to .epub. It has special requirements by including an uncompressed mime type file while the rest of the data in the file is compressed. An ePub reader should be capable of reading the content in its compressed format. <sup>http://wiki.mobileread.com/wiki/EPUB</sup>
 +
 +
This is currently the main format used by the Amazon Kindle, replacing earlier use of [[MOBI]] or its variant [[AZW]] / AZW3.
  
 
== Version 2 ==
 
== Version 2 ==
  
 
[http://www.idpf.org/specs.htm The IDPF specification page] contains the specifications for this format. In particular check the version 2.01 OPS and OPF specifications and the version 1.01 OCF specifications. The informational documents are also quite useful in understanding the standard's intent and content.
 
[http://www.idpf.org/specs.htm The IDPF specification page] contains the specifications for this format. In particular check the version 2.01 OPS and OPF specifications and the version 1.01 OCF specifications. The informational documents are also quite useful in understanding the standard's intent and content.
 +
 +
An alternative XML syntax called DTBook was available as an option in this version, but this was removed in EPUB 3.
  
 
=== References ===
 
=== References ===
Line 28: Line 36:
 
ePub version 3 is the newest version of the standard and has now been recommended by the idpf standards committee.
 
ePub version 3 is the newest version of the standard and has now been recommended by the idpf standards committee.
  
In version 2.01 there were three defining documents, the [[OPF]] (Open Packaging Format), the [[OCF]] (Open Container Format), and the [[OPS]] (Open Publications Structure). The OPS referenced a [[DAISY]] standard for the [[NCX]] file. The new 3.0 standard has 4 defining documents with new names. The OPF becomes the [[ePub Publications standard]]. The OCF remains the same and the OPS received the most changes to become the [[ePub Content Documents]]. This now includes the old NCX specifications which are no longer used. A fourth document is concerned with [[Media Overlays]] and is a new feature of ePub version 3. <sup>http://wiki.mobileread.com/wiki/EPub_3</sup>
+
In version 2.01 there were three defining documents, the [[OPF]] (Open Packaging Format), the [[OCF]] (Open Container Format), and the [[OPS]] (Open Publications Structure). The OPS referenced a [[Daisy]] standard for the [[NCX]] file. The new 3.0 standard has 4 defining documents with new names. The OPF becomes the [[ePub Publications standard]]. The OCF remains the same and the OPS received the most changes to become the [[ePub Content Documents]]. This now includes the old NCX specifications which are no longer used. A fourth document is concerned with [[Media Overlays]] and is a new feature of ePub version 3. <sup>http://wiki.mobileread.com/wiki/EPub_3</sup>
  
 
=== References ===
 
=== References ===
Line 34: Line 42:
 
* http://wiki.mobileread.com/wiki/EPub_3
 
* http://wiki.mobileread.com/wiki/EPub_3
 
* http://idpf.org/epub/30/spec/epub30-overview.html
 
* http://idpf.org/epub/30/spec/epub30-overview.html
 +
* http://www.idpf.org/epub/31/spec/epub-spec.html
 +
 +
== Digital Rights Management & Encryption ==
 +
When preserving ePub files, it is important to know what if any rights restrictions and encryptions are present, so that action can be taken to ensure the content can still be accessed in the future. Unfortunately, there are currently no reliable tools we can use to automate this analysis. A general introduction to copy protection and ePub can be found [http://idpf.org/epub-content-protection here].
 +
 +
Adobe [http://www.techdirt.com/articles/20140204/07381226084/adobe-releases-new-drm-ebooks-plans-to-screw-over-anyone-using-old-drm.shtml attracted controversy] early in 2014 by announcing a "new, improved" [[DRM]] scheme for their version of EPUB files, which has the "feature" of being incompatible with the many e-readers that support their ''old'' DRM.
 +
 +
=== How it works ===
 +
 +
According to section 3.2 ("[http://idpf.org/epub/30/spec/epub30-ocf.html#sec-zip-container-zipreqs OCF ZIP Container]") of the EPUB 3 spec. (section 4 in the 2.0.1 spec.):
 +
<blockquote>
 +
"Conforming OCF ZIP Containers MUST NOT use the encryption features defined by the ZIP format..."
 +
</blockquote>
 +
i.e. a valid EPUB's content files can always be read regardless of the DRM scheme in use.
 +
 +
 +
Section 2.5.5 ("[http://idpf.org/epub/30/spec/epub30-ocf.html#sec-container-metainf-rights.xml Rights Management]"), again 3.5.6 in the 2.0.1 spec, states that:
 +
<blockquote>
 +
"An OPTIONAL file with the name “rights.xml” within the “META-INF” directory at the root level of the container file system is a reserved name in a valid OCF container.
 +
"The rights.xml file MUST NOT be encrypted.
 +
"When the rights.xml file is not present, the OCF container provides no information indicating any part of the container is rights governed."
 +
</blockquote>
 +
i.e. the presence of the META-INF/rights.xml file is an ''indicator'' that DRM is likely in use (however, its absence is apparently not an indication of the contrary).
 +
 +
 +
Section 2.5.2 ("[http://idpf.org/epub/30/spec/epub30-ocf.html#sec-container-metainf-encryption.xml Encryption]") - 3.5.5 in 2.0.1 - states that:
 +
<blockquote>
 +
"...if any resource within the container is encrypted, “encryption.xml” MUST be present to indicate that the resource is encrypted and provide information on how it is encrypted."
 +
</blockquote>
 +
Consequenly the existence of this file indicates that one or more encrypted files exist within the EPUB. However, as the rights information is essentially decoupled from the encryption scheme
 +
 +
* Encrypted but not strictly governed by rights management.
 +
* Governed by right management but not encrypted.
 +
* Both encrypted and governed by rights management.
 +
 +
Though in these latter two cases the rights governance may not be clear from the EPUB structure.
 +
 +
Furthermore, note that not all encryption schemes are 'bad', i.e. requiring individual private key information - some are mere 'obfuscation' (some sample, with obfuscated fonts, can be found here: http://idpf.github.io/epub3-samples/samples.html). Any reliable risk identification tool will have to take this into account.
 +
 +
=== Digital watermarks ===
 +
 +
Commercial e-books often have embedded "watermarks" that allow a particular copy to be traced to the person who originally purchased it. Several schemes have been used, most controversially ones that actually alter the text of the book to distinguish copies (possibly harming their literary quality in the process). Other schemes simply involve embedded numbers or images in places where they aren't very noticable. One variety uses [[bar codes]] embedded in '''data:''' [[URL]]s.
 +
 +
* [http://koen.io/2013/09/what-an-e-book-watermark-looks-like/ What an e-book watermark looks like (external link)]
 +
 +
== Software ==
 +
 +
* [http://sourceforge.net/projects/crengine/ Cool Reader]
 +
* [http://azardi.infogridpacific.com/ Azardi ePub3 Reader]
 +
* [http://readium.org/ Readium] - Open source library for reading ePub versions 2 and 3 (including Google Chrome Web App)
 +
* [https://github.com/IDPF/epubcheck ePubCheck] - ePub validator (suppports versions 2 and 3)
 +
* [https://github.com/titusz/epubcheck Python wrappers for EpubCheck]
 +
* [http://sourceforge.net/projects/libepubgen/ libepubgen: EPUB generator for librevenge framework]
 +
* [http://johnmacfarlane.net/pandoc/ Pandoc: Document format conversion swiss-army knife]
 +
* [https://github.com/user-none/Sigil Sigil: multi-platform EPUB editor]
 +
* [https://github.com/aerkalov/ebooklib Ebooklib: Python library for reading/writing EPUB, including EPUB 3]
 +
* [https://calibre-ebook.com/ calibre]
 +
 +
== Online utilities ==
 +
 +
* [http://validator.idpf.org/ Online ePub validator] (based on ePubCheck)
 +
* [http://ebookflightdeck.com/ EPUB validator / best-practice checker (requires registration to access)]
 +
* [http://www.epubtest.org/ epubtest: Online EPUB resources]
 +
 +
== Sample files ==
 +
 +
* [https://github.com/IDPF/epub3-samples ePub 3 samples] - intended to demonstrate features of ePub 3
 +
* [http://azardi.infogridpacific.com/resources.html Azardi ePub 3 samples]
 +
* [http://apex.infogridpacific.com/dcp/flo-test-books.html Azardi Fixed Layout ePub 3 samples]
 +
* [http://apex.infogridpacific.com/dcp/gdm.html Guy de Maupassant Short Stories in 5 formats] - includes ePub 3
 +
* [https://github.com/mgylling/epub-testsuite ePub testsuite] - Apparently by IPDF, under construction and so far without documentation (status October 2013)
 +
* [http://www.hindawi.com/epub/ Hindawi sample articles] (publisher of [http://www.hindawi.com/journals/ open-access scientific journals, most of which] offer content in ePub format)
 +
* [http://www.jneuroinflammation.com/ Journal of Neuroinflammation] - Open access scientific journal that offers content in ePub format
 +
* [http://journals.lww.com/pages/results.aspx?txtKeywords=epub ePubs from Lippincott Williams & Wilkins journals]
 +
* [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.epub Homeland, by Cory Doctorow]
 +
* [https://github.com/IDPF/epub-testsuite EPub test suite] by IPDF, includes scripts for making one's own sample files
 +
* [https://threepress.googlecode.com/svn-history/r583/branches/bookworm-caching/library/test-data/data/hauy.epub Example file of DTBook variant of EPUB 2]
 +
* [https://github.com/bitsgalore/epubPolicyTests ePub KB policy testing examples]
 +
* {{DexvertSamples|document/epub}}
 +
 +
== Links ==
 +
 +
* [http://www.openplanetsfoundation.org/blogs/2012-06-18-epub-archival-preservation EPUB for archival preservation] - Blog post with link to report (2012 ) by KB / National Library of the Netherlands
 +
* [http://www.openplanetsfoundation.org/blogs/2013-05-23-epub-archival-preservation-update EPUB for archival preservation: an update] - Update (2013) to KB report
 +
* [https://gist.github.com/bitsgalore/da04413787931d20a8bf How to package an epub file using InfoZip]
 +
* [https://researchkb.wordpress.com/2015/03/13/policy-based-assessment-of-epub-with-epubcheck/ Policy-based assessment of EPUB with Epubcheck]
 +
* [http://epubsecrets.com/the-best-epub-reader-for-windows.php The Best EPUB reader for Windows?]
 +
* [http://w3c.github.io/epubweb/ Advancing Portable Documents for the Open Web Platform: EPUB-WEB (W3C White Paper)]
 +
* [http://w3c.github.io/dpub/idpf-digital-book-2015/index.html The Convergence of EPUB and the Web (Presentation)]
 +
* [http://wiki.dpconline.org/images/e/e4/EPUB_Assessment_v1.1.pdf EPUB Format Preservation Assessment]
 +
* [https://fileformats.wordpress.com/2016/02/05/epub-3-1/ “Radical” changes in EPUB 3.1]
 +
* [http://blog.kbresearch.nl/2016/03/10/the-future-of-epub-a-first-look-at-the-epub-3-1-editors-draft/ The future of EPUB? A first look at the EPUB 3.1 Editor’s draft]
 +
* [http://www.albertopettarin.it/blog/2015/02/21/current-fixed-layout-ebooks-considered-harmful.html (Current) Fixed Layout eBooks Considered Harmful]
 +
* [https://www.loc.gov/preservation/digital/formats/fdd/fdd000311.shtml Library of Congress preservation status: EPUB 3.0.1]
 +
* [https://www.loc.gov/preservation/digital/formats/fdd/fdd000519.shtml EPUB (Electronic publication) Version 3 Preservation]
  
 +
[[Category:ZIP based file formats]]
 
[[Category:XML based file formats]]
 
[[Category:XML based file formats]]

Latest revision as of 21:44, 4 October 2024

File Format
Name EPUB
Ontology
Extension(s) .epub
MIME Type(s) application/epub+zip
LoCFDD fdd000310
PRONOM fmt/483
Wikidata ID Q27196933


Contents

[edit] Description

ePub is an open format defined by the Open eBook Forum of the International Digital Publishing Forum (IDPF). It is based on XHTML and XML along with optional CSS stylesheets. Its predecessor was the OEB standard.

Quoted from the IDPF web site:

'.epub' is the file extension of an XML format for reflowable digital books and publications. '.epub' is composed of three open standards, the Open Publication Structure (OPS), Open Packaging Format (OPF) and Open Container Format (OCF), produced by the IDPF. '.epub' allows publishers to produce and send a single digital publication file through distribution and offers consumers interoperability between software/hardware for unencrypted reflowable digital books and other publications. The Open eBook Publication Structure or 'OEB', originally produced in 1999, is the precursor to OPS.

The intent of ePub is to serve both as a source file format and an end user format. For this reason the files are collected into a container for easy dissemination and use. This container is generally a zip file but the extension has been renamed to .epub. It has special requirements by including an uncompressed mime type file while the rest of the data in the file is compressed. An ePub reader should be capable of reading the content in its compressed format. http://wiki.mobileread.com/wiki/EPUB

This is currently the main format used by the Amazon Kindle, replacing earlier use of MOBI or its variant AZW / AZW3.

[edit] Version 2

The IDPF specification page contains the specifications for this format. In particular check the version 2.01 OPS and OPF specifications and the version 1.01 OCF specifications. The informational documents are also quite useful in understanding the standard's intent and content.

An alternative XML syntax called DTBook was available as an option in this version, but this was removed in EPUB 3.

[edit] References

[edit] Version 3

ePub version 3 is the newest version of the standard and has now been recommended by the idpf standards committee.

In version 2.01 there were three defining documents, the OPF (Open Packaging Format), the OCF (Open Container Format), and the OPS (Open Publications Structure). The OPS referenced a Daisy standard for the NCX file. The new 3.0 standard has 4 defining documents with new names. The OPF becomes the ePub Publications standard. The OCF remains the same and the OPS received the most changes to become the ePub Content Documents. This now includes the old NCX specifications which are no longer used. A fourth document is concerned with Media Overlays and is a new feature of ePub version 3. http://wiki.mobileread.com/wiki/EPub_3

[edit] References

[edit] Digital Rights Management & Encryption

When preserving ePub files, it is important to know what if any rights restrictions and encryptions are present, so that action can be taken to ensure the content can still be accessed in the future. Unfortunately, there are currently no reliable tools we can use to automate this analysis. A general introduction to copy protection and ePub can be found here.

Adobe attracted controversy early in 2014 by announcing a "new, improved" DRM scheme for their version of EPUB files, which has the "feature" of being incompatible with the many e-readers that support their old DRM.

[edit] How it works

According to section 3.2 ("OCF ZIP Container") of the EPUB 3 spec. (section 4 in the 2.0.1 spec.):

"Conforming OCF ZIP Containers MUST NOT use the encryption features defined by the ZIP format..."

i.e. a valid EPUB's content files can always be read regardless of the DRM scheme in use.


Section 2.5.5 ("Rights Management"), again 3.5.6 in the 2.0.1 spec, states that:

"An OPTIONAL file with the name “rights.xml” within the “META-INF” directory at the root level of the container file system is a reserved name in a valid OCF container. "The rights.xml file MUST NOT be encrypted. "When the rights.xml file is not present, the OCF container provides no information indicating any part of the container is rights governed."

i.e. the presence of the META-INF/rights.xml file is an indicator that DRM is likely in use (however, its absence is apparently not an indication of the contrary).


Section 2.5.2 ("Encryption") - 3.5.5 in 2.0.1 - states that:

"...if any resource within the container is encrypted, “encryption.xml” MUST be present to indicate that the resource is encrypted and provide information on how it is encrypted."

Consequenly the existence of this file indicates that one or more encrypted files exist within the EPUB. However, as the rights information is essentially decoupled from the encryption scheme

  • Encrypted but not strictly governed by rights management.
  • Governed by right management but not encrypted.
  • Both encrypted and governed by rights management.

Though in these latter two cases the rights governance may not be clear from the EPUB structure.

Furthermore, note that not all encryption schemes are 'bad', i.e. requiring individual private key information - some are mere 'obfuscation' (some sample, with obfuscated fonts, can be found here: http://idpf.github.io/epub3-samples/samples.html). Any reliable risk identification tool will have to take this into account.

[edit] Digital watermarks

Commercial e-books often have embedded "watermarks" that allow a particular copy to be traced to the person who originally purchased it. Several schemes have been used, most controversially ones that actually alter the text of the book to distinguish copies (possibly harming their literary quality in the process). Other schemes simply involve embedded numbers or images in places where they aren't very noticable. One variety uses bar codes embedded in data: URLs.

[edit] Software

[edit] Online utilities

[edit] Sample files

[edit] Links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox