Zstandard dictionary

From Just Solve the File Format Problem
Revision as of 17:17, 13 May 2019 by Effect2 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
File Format
Name Zstandard dictionary
Ontology
Released 2015[1]

The compression library Zstandard (also known as "Zstd") has the ability to create an external "dictionary" from a set of training files which can be used to more efficiently (in terms of compression and decompression speed and also in terms of compression ratio) compress files of the same type as the training files. For example, if a dictionary is "trained" on an example set of email messages, anyone with access to the dictionary will be able to more efficiently compress another email file. The trick is that the commonalities are kept in the dictionary file, and, therefore, anyone wishing to decompress the email must have already had that same dictionary sent to them.[2]

Zstandard compression dictionaries each have a 32-bit integer associated with them called the "dictionary ID". Normally, this ID is a completely random value[3], but the specification says that dictionary IDs less than or equal to 2^15 - 1 = 32767 or greater than or equal to 2^31 = 2147483648 "are reserved [perhaps for technical reasons, or for allocation by an imagined future standards body] and should not be used"[4]. If a compression frame uses a dictionary, it will store the dictionary ID, little-endian, in its header.[5]

Part of how dictionaries internally work is by serving as a reference for sequences of bytes that occur both in the real stream and in the dictionary, so that, when a common sequence is encountered during compression, a reference can be made to its position in the dictionary, saving space.[4] This allows for any arbitrarily file of any type to be used as a dictionary, and indeed Zstandard allows for this[4], but the specialized dictionary format can probably be expected to be more widely-used, especially because it has additional compression-aiding features beyond this. Examination of a test file with a hex editor seems to find that these any-file dictionaries get a dictionary ID field with a length of 0, which the API represents as the integer 0[6].

Zstandard dictionaries begin with a magic number, 37 A4 30 EC[7], followed by the little-endian dictionary ID[4]. As of May 2019, the dictionary is by default named "dictionary", with no extension.[8]. At the time of writing, there are no Zstd dictionaries that can easily be found on the Web, and it is unclear what convention, if any, there may be in non-public contexts for naming them.

References

  1. https://github.com/facebook/zstd/releases/tag/v0.4.3
  2. https://github.com/facebook/zstd/blob/dev/README.md
  3. https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h → definition of ZDICT_params_t
  4. 4.0 4.1 4.2 4.3 https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format
  5. https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#zstandard-frames
  6. https://github.com/facebook/zstd/blob/dev/lib/decompress/zstd_decompress.c → function ZSTD_getFrameHeader_advanced
  7. https://github.com/facebook/zstd/blob/dev/lib/zstd.h, definition of ZSTD_MAGIC_DICTIONARY
  8. https://github.com/facebook/zstd/blob/69baaee3e42f90dedea2c946bc19bfeac4e782ee/programs/zstdcli.c → definition of g_defaultDictName; this is also verified by printing the CLI help
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox