Zstandard dictionary
(Removed extension (only belongs to "main" format) from infobox) |
m (Changed how the ID limits are written to prevent the result from being taken as part of the power-of-two expression) |
||
(One intermediate revision by one user not shown) | |||
Line 7: | Line 7: | ||
The [[compression]] library [[Zstandard]] (also known as "Zstd") has the ability to create an external "dictionary" from a set of training files which can be used to more efficiently (in terms of compression and decompression speed and also in terms of compression ratio) compress files of the same type as the training files. For example, if a dictionary is "trained" on an example set of [[Internet e-mail message format|email]] messages, anyone with access to the dictionary will be able to more efficiently compress another email file. The trick is that the commonalities are kept in the dictionary file, and, therefore, anyone wishing to decompress the email must have already had that same dictionary sent to them.<ref>https://github.com/facebook/zstd/blob/dev/README.md</ref> | The [[compression]] library [[Zstandard]] (also known as "Zstd") has the ability to create an external "dictionary" from a set of training files which can be used to more efficiently (in terms of compression and decompression speed and also in terms of compression ratio) compress files of the same type as the training files. For example, if a dictionary is "trained" on an example set of [[Internet e-mail message format|email]] messages, anyone with access to the dictionary will be able to more efficiently compress another email file. The trick is that the commonalities are kept in the dictionary file, and, therefore, anyone wishing to decompress the email must have already had that same dictionary sent to them.<ref>https://github.com/facebook/zstd/blob/dev/README.md</ref> | ||
− | Zstandard compression dictionaries each have a 32-bit integer associated with them called the "dictionary ID". Normally, this ID is a completely random value<ref>https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h → definition of ZDICT_params_t</ref>, but the specification says that dictionary IDs less than or equal to 2^15 - 1 | + | Zstandard compression dictionaries each have a 32-bit integer associated with them called the "dictionary ID". Normally, this ID is a completely random value<ref>https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h → definition of ZDICT_params_t</ref>, but the specification says that dictionary IDs less than or equal to 2^15 - 1 = 32767 or greater than or equal to 2^31 = 2147483648 "are reserved [perhaps for technical reasons, or for allocation by an imagined future standards body] and should not be used"<ref name="dictf">https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format</ref>. If a compression frame uses a dictionary, it will store the dictionary ID, [[Endianness|little-endian]], in its header.<ref>https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#zstandard-frames</ref> |
Part of how dictionaries internally work is by serving as a reference for sequences of bytes that occur both in the real stream and in the dictionary, so that, when a common sequence is encountered during compression, a reference can be made to its position in the dictionary, saving space.<ref name="dictf" /> This allows for any arbitrarily file of any type to be used as a dictionary, and indeed Zstandard allows for this<ref name="dictf" />, but the specialized dictionary format can probably be expected to be more widely-used, especially because it has additional compression-aiding features beyond this. Examination of a test file with a hex editor seems to find that these any-file dictionaries get a dictionary ID field with a length of 0, which the API represents as the integer 0<ref>https://github.com/facebook/zstd/blob/dev/lib/decompress/zstd_decompress.c → function ZSTD_getFrameHeader_advanced</ref>. | Part of how dictionaries internally work is by serving as a reference for sequences of bytes that occur both in the real stream and in the dictionary, so that, when a common sequence is encountered during compression, a reference can be made to its position in the dictionary, saving space.<ref name="dictf" /> This allows for any arbitrarily file of any type to be used as a dictionary, and indeed Zstandard allows for this<ref name="dictf" />, but the specialized dictionary format can probably be expected to be more widely-used, especially because it has additional compression-aiding features beyond this. Examination of a test file with a hex editor seems to find that these any-file dictionaries get a dictionary ID field with a length of 0, which the API represents as the integer 0<ref>https://github.com/facebook/zstd/blob/dev/lib/decompress/zstd_decompress.c → function ZSTD_getFrameHeader_advanced</ref>. | ||
Line 15: | Line 15: | ||
== References == | == References == | ||
<references /> | <references /> | ||
+ | |||
+ | [[Category:Facebook]] |
Latest revision as of 17:17, 13 May 2019
The compression library Zstandard (also known as "Zstd") has the ability to create an external "dictionary" from a set of training files which can be used to more efficiently (in terms of compression and decompression speed and also in terms of compression ratio) compress files of the same type as the training files. For example, if a dictionary is "trained" on an example set of email messages, anyone with access to the dictionary will be able to more efficiently compress another email file. The trick is that the commonalities are kept in the dictionary file, and, therefore, anyone wishing to decompress the email must have already had that same dictionary sent to them.[2]
Zstandard compression dictionaries each have a 32-bit integer associated with them called the "dictionary ID". Normally, this ID is a completely random value[3], but the specification says that dictionary IDs less than or equal to 2^15 - 1 = 32767 or greater than or equal to 2^31 = 2147483648 "are reserved [perhaps for technical reasons, or for allocation by an imagined future standards body] and should not be used"[4]. If a compression frame uses a dictionary, it will store the dictionary ID, little-endian, in its header.[5]
Part of how dictionaries internally work is by serving as a reference for sequences of bytes that occur both in the real stream and in the dictionary, so that, when a common sequence is encountered during compression, a reference can be made to its position in the dictionary, saving space.[4] This allows for any arbitrarily file of any type to be used as a dictionary, and indeed Zstandard allows for this[4], but the specialized dictionary format can probably be expected to be more widely-used, especially because it has additional compression-aiding features beyond this. Examination of a test file with a hex editor seems to find that these any-file dictionaries get a dictionary ID field with a length of 0, which the API represents as the integer 0[6].
Zstandard dictionaries begin with a magic number, 37 A4 30 EC
[7], followed by the little-endian dictionary ID[4]. As of May 2019, the dictionary is by default named "dictionary", with no extension.[8]. At the time of writing, there are no Zstd dictionaries that can easily be found on the Web, and it is unclear what convention, if any, there may be in non-public contexts for naming them.
[edit] References
- ↑ https://github.com/facebook/zstd/releases/tag/v0.4.3
- ↑ https://github.com/facebook/zstd/blob/dev/README.md
- ↑ https://github.com/facebook/zstd/blob/dev/lib/dictBuilder/zdict.h → definition of ZDICT_params_t
- ↑ 4.0 4.1 4.2 4.3 https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#dictionary-format
- ↑ https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#zstandard-frames
- ↑ https://github.com/facebook/zstd/blob/dev/lib/decompress/zstd_decompress.c → function ZSTD_getFrameHeader_advanced
- ↑ https://github.com/facebook/zstd/blob/dev/lib/zstd.h, definition of ZSTD_MAGIC_DICTIONARY
- ↑ https://github.com/facebook/zstd/blob/69baaee3e42f90dedea2c946bc19bfeac4e782ee/programs/zstdcli.c → definition of g_defaultDictName; this is also verified by printing the CLI help