ZIP
- Not to be confused with Zip disk, an unrelated disk cartridge unit.
ZIP is one of the most popular file compression formats. It was created in 1989 as the native format of the PKZIP program, which was introduced by Phil Katz (with co-creator Gary Conway) in the wake of a lawsuit (which he lost) against him by the makers of the then-popular ARC program (and file format) for copyright and trademark infringement in an earlier program PKARC which had been file-compatible with ARC. This resulted in Katz creating a new file format, which rapidly overtook ARC in popularity (to a large extent because of BBS sysops, then the primary users of such compression, resenting the lawsuit). Many programs have been released for a variety of operating systems to compress and decompress ZIP files, and native support for the format is built into several popular operating systems.
ZIP implementations vary in their support for features in the specification from PKWARE[1], particularly features added since version 2 (1993), some of which are protected by patents and require licensing. Many implementations limit the use of compression to the DEFLATE algorithm, introduced with version 2. Extensions incorporated into the specification that have been widely adopted are: long filenames; large files (using a technique known as ZIP64); and filenames in UTF-8. In 2011 work began on an interoperable subset of the latest APPNOTE.TXT with the intention of publication as ISO/IEC 21320-1, Document Container File -- Part 1: Core. As of November 2012, a discussion draft is available[2]. Designed to promote interoperable implementations, the draft ISO/IEC 21320-1 prohibits compression other than using DEFLATE, segmentation or multiple volumes, and features that are subject to patents.
While .zip is the usual file extension, ZIP-formatted files can be found with many other extensions since a number of other file formats use ZIP compression but store their files in application-specific extensions. See Category:ZIP based file formats for a list of such formats.
Contents |
See also
Disambiguation
The term "ZIP compression" is sometimes misleadingly used to mean DEFLATE (which is by far the most common compression scheme used in ZIP files). In such cases, the compressed data format could turn out to be raw DEFLATE, or zlib, or gzip.
Identification
The byte sequence 'P' 'K' 0x05 0x06
(the "end of central directory signature") appears somewhere in the file, usually beginning exactly 22 bytes from the end of the file. However, it will appear earlier if the file contains a "ZIP file comment" (common in the BBS era, but rare today), or for various other reasons. There seems to be no theoretical limit to how far back you may have to search for the signature, but some software limits it to around 64KB, which is the maximum length of a comment.
Most ZIP files (but not self-extracting ZIP files) happen to begin with 'P' 'K' 0x03 0x04
. This is not a global file signature, but is the signature that appears once for every compressed file inside the ZIP file. Some ZIP-based formats are designed such that they necessarily begin in this way. But in general, it is even legal for a ZIP file to contain zero files, and such a ZIP file would not contain this signature at all.
That Phil Katz guy has thus managed to get his initials at the start of a large number of files on many millions of computers and devices, given how many file formats are based on ZIP (even if they use different extensions). He died in 2000, but this memorial to him will live on indefinitely.
Compression
Each file in a ZIP file is compressed using one of a number of compression algorithms. Only compression types 0 (uncompressed) and 8 (DEFLATE) are likely to be seen in modern portable ZIP files. In old ZIP files, types 1 (Shrink) and 6 (Implode) are common.
Code | Compression scheme | Notes and references |
---|---|---|
0 | Uncompressed | |
1 | Shrink | LZW. Used by PKZIP 0.x and 1.x. |
2–5 | Reduce | LZ77 + prediction. Used by PKZIP v0.x. See also SCRNCH. |
6 | Implode | LZ77 + Huffman. Used by PKZIP v1.x. |
7 | Tokenized | Never used? |
8 | DEFLATE | LZ77 + Huffman. Used by PKZIP v2.0+. |
9 | Deflate64, a.k.a. Enhanced Deflate | Format version 2.1+. |
10 | PKWARE DCL Implode (old IBM TERSE) | Format version 2.5+. |
12 | Bzip2 | Format version 4.6+. |
14 | LZMA (EFS) | Defined in ZIP specification v6.3+. |
16 | IBM z/OS CMPSC | |
18 | IBM TERSE (new) | |
19 | IBM LZ77 z Architecture (PFS) | |
94 | MP3 | Supported by WinZip 21+. |
95 | XZ | Supported by WinZip 18+. |
96 | JPEG variant | |
97 | WavPack | Defined in ZIP specification v6.3.2+. |
98 | PPMd version I, Rev 1 | Defined in ZIP specification v6.3+. |
99 | AES / AE-x encryption marker |
Extensible data fields
Each member file of a ZIP file may have one or more extensible data fields (or extra fields), containing arbitrary data. Each field is tagged with a 16-bit identifier. Extra fields are normally used for platform-specific or filesystem-specific metadata, or to work around limitations of the original ZIP format. They are not normally used for application-specific data.
Most of the extra fields in use are documented in the ZIP "APPNOTE" specification, or by the Info-ZIP software (e.g. the proginfo/extrafld.txt file in the Zip program's source distribution).
Known extensible data fields
ID | Owner | Description | Reference (identification) | Reference (details) |
---|---|---|---|---|
0x0001 | PKWARE | Zip64 extended information | APPNOTE | APPNOTE, Info-ZIP |
0x0007 | PKWARE | AV Info | APPNOTE | |
0x0008 | PKWARE | Reserved for extended language encoding data (PFS) | APPNOTE | |
0x0009 | PKWARE | OS/2 | APPNOTE | APPNOTE, Info-ZIP |
0x000a | PKWARE | NTFS | APPNOTE | APPNOTE, Info-ZIP |
0x000c | PKWARE | OpenVMS | APPNOTE | APPNOTE, Info-ZIP |
0x000d | PKWARE | UNIX | APPNOTE | APPNOTE, Info-ZIP |
0x000e | PKWARE | Reserved for file stream and fork descriptors | APPNOTE | |
0x000f | PKWARE | Patch Descriptor | APPNOTE | APPNOTE, Info-ZIP |
0x0014 | PKWARE | PKCS#7 Store for X.509 Certificates | APPNOTE | APPNOTE, Info-ZIP |
0x0015 | PKWARE | X.509 Certificate ID and Signature for individual file | APPNOTE | APPNOTE, Info-ZIP |
0x0016 | PKWARE | X.509 Certificate ID for Central Directory | APPNOTE | APPNOTE, Info-ZIP |
0x0017 | PKWARE | Strong Encryption Header | APPNOTE | APPNOTE |
0x0018 | PKWARE | Record Management Controls | APPNOTE | APPNOTE |
0x0019 | PKWARE | PKCS#7 Encryption Recipient Certificate List | APPNOTE | APPNOTE |
0x0020 | PKWARE | Reserved for Timestamp | APPNOTE | |
0x0021 | PKWARE | Policy Decryption Key | APPNOTE | APPNOTE |
0x0022 | PKWARE | Smartcrypt Key Provider | APPNOTE | APPNOTE |
0x0023 | PKWARE | Smartcrypt Policy Key Data | APPNOTE | APPNOTE |
0x0065 | PKWARE | MVS / IBM S/390 (Z390) attributes - uncompressed | APPNOTE | APPNOTE |
PKWARE | OS/400 / AS/400 (I400) attributes - uncompressed | APPNOTE | APPNOTE | |
0x0066 | PKWARE | Reserved for IBM S/390 (Z390), AS/400 (I400) attributes - compressed | APPNOTE | |
0x07c8 | Macintosh (Info-ZIP Macintosh, old) | APPNOTE | Info-ZIP | |
0x2605 | ZipIt Macintosh | APPNOTE | APPNOTE, Info-ZIP | |
0x2705 | ZipIt Macintosh 1.3.5+ (w/o full filename) | APPNOTE | APPNOTE, Info-ZIP | |
0x2805 | ZipIt Macintosh 1.3.5+ | APPNOTE | APPNOTE | |
0x334d "M3 " |
Info-ZIP Macintosh | APPNOTE | Info-ZIP | |
0x4154 "TA " |
Tandem NSK | Info-ZIP | Info-ZIP | |
0x4341 "AC " |
Acorn/SparkFS | APPNOTE | Info-ZIP | |
0x4453 "SE " |
Windows NT security descriptor (binary ACL) | APPNOTE | Info-ZIP | |
0x4690 | PKWARE | POSZIP 4690 (reserved) | APPNOTE | |
0x4704 | VM/CMS | APPNOTE | Info-ZIP | |
0x470f | MVS | APPNOTE | Info-ZIP | |
0x4854 "TH " |
Theos (old) | Info-ZIP | Info-ZIP | |
0x4b46 "FK " |
FWKCS MD5 | APPNOTE | APPNOTE, Info-ZIP | |
0x4c41 "AL " |
OS/2 access control list (text ACL) | APPNOTE | Info-ZIP | |
0x4d49 "IM " |
Info-ZIP OpenVMS | APPNOTE | Info-ZIP | |
0x4d63 "cM " |
Macintosh SmartZIP | Info-ZIP | Info-ZIP | |
0x4f4c "LO " |
Xceed original location | APPNOTE | ||
0x5350 "PS " |
(Observed in some Psion files.) | |||
0x5356 "VS " |
AOS/VS (binary ACL) | APPNOTE | Info-ZIP | |
0x5455 "UT " |
Extended timestamp | APPNOTE | Info-ZIP | |
0x554e "NU " |
Xceed unicode | APPNOTE | ||
0x5855 "UX " |
Info-ZIP UNIX (original, also OS/2, NT, etc.) | APPNOTE | Info-ZIP | |
0x6375 "uc " |
Info-ZIP Unicode Comment | APPNOTE | APPNOTE, Info-ZIP | |
0x6542 "Be " |
BeOS (BeBox, PowerMac, etc.) | APPNOTE | Info-ZIP | |
0x6854 "Th " |
Theos | Info-ZIP | Info-ZIP | |
0x7075 "up " |
Info-ZIP Unicode Path | APPNOTE | APPNOTE, Info-ZIP | |
0x7441 "At " |
AtheOS | Old Info-ZIP | Old Info-ZIP (e.g. zip v2.32 [1]) | |
0x756e "nu " |
ASi UNIX | APPNOTE | Info-ZIP | |
0x7855 "Ux " |
Info-ZIP Unix (previous new) | APPNOTE | Info-ZIP | |
0x7875 "ux " |
Info-ZIP Unix (new) | Info-ZIP | Info-ZIP | |
0xa220 | Microsoft Open Packaging Growth Hint | APPNOTE | APPNOTE | |
0xfb4a | SMS/QDOS | Info-ZIP | Info-ZIP | |
0xfd4a | SMS/QDOS | APPNOTE |
Specifications
- APPNOTEs - The format documentation from PKWARE is traditionally in a file named APPNOTE.TXT.
- APPNOTE from PKWARE (latest version of formal spec)
- APPNOTE Archives from PKWARE (selected versions all the way back to 1.0)
- Documentation from Info-ZIP (Includes Info-ZIP variants on APPNOTE.TXT dated from 1996 to 2004, specifications used as the basis for various open-source tools)
- LoC archive
- An early version of APPNOTE (apparently from PKZIP v1.10)
- APPNOTE.ZIP - Possibly the first v2.x APPNOTE
- APPNOTE v6.1.0, from archive.org
- Bundled with PKZIP software through v1.93 - refer to the #Software section below.
- IANA registration for application/zip in July 1993 (corresponds to version 2 of APPNOTE.TXT)
- November 2012 working draft of ISO/IEC WD 21320-1, Document Container File -- Part 1: Core Intended as restricted subset of APPNOTE 6.3.3 designed to promote interoperability.
- February 2013 committee draft of ISO/IEC CD 21320-1, Document Container File -- Part 1: Core Essentially the same as November 2012 working draft except that it mandates use of the UTF-8 indicator.
- Archive format info, including ZIP (from 1989, when ZIP was newly released)
- ZIP file header format (among other archive types)
- TorrentZip
- Note that in general there is no official file name encoding for ZIP files, and non ASCII filenames are not generally well supported. The original implementation specified IBM Code Page 437 for filenames, but as many characters cannot be expressed in that encoding, the filename bytes have often been interpreted using the current system codepage (implementation dependent behaviour). There is a flag to specify UTF-8 as the encoding, but it is not supported in all major clients (e.g. Windows Explorer).
- Info-ZIP's "extra fields" documentation
Metaformat files
- Synalysis grammar file (for Hexinator / Synalize It!; more details)
Software
- Info-ZIP: Zip, UnZip
- 7-Zip
- zlib - The zlib library does not support ZIP format, but it is distributed with "minizip" code that supports most ZIP files.
- libzip - Uses zlib.
- libarchive - Uses zlib.
- zziplib
- Archive::ZZip: Perl bindings for zziplib
- miniz
- PKZIP
- Konvertor
- Deark (for analysis, or converting old compression methods)
Sample files
- https://github.com/corkami/pocs/tree/master/zip
- Examples that use the uncommon "Reduce" compression scheme: VISA_CRD.ZIP, 1608A.ZIP → D1-MAC.ZIP
Links
- Wikipedia: Zip (file format)
- Wikipedia: PKZIP
- Zip files all the way down (creating an infinitely-regressed ZIP file)
- ZIP101 an archive walkthrough
- Serve deepzoom images from a zip archive with openseadragon
- How are zlib, gzip and Zip related? What do they have in common and how are they different? - Response to StackOverflow question by zlib/gzip co-creator Mark Adler
- Does Microsoft OneDrive export large ZIP files that are corrupt? - Discusses an issue where large ZIP files generated by Microsoft OneDrive result in read errors when they are opened with tools like Info-Zip and 7-Zip
- ZIP is Broken, Except it’s Not, Except it Is