Talk:EBZip
Contents |
ebzip
caveats
Intended to be a place to note down issues, as well as tips and tricks with eb-4.4.3
.
Hardcoded to look specifically for a file that begins with "CATALOGS"
(case-insensitive), or "CATALOGS.ebz"
Custom dictionaries doesn't always follow standard convention to store their content(s), like having "START.ebz
" in the top level. Even if they do, they may also compress "CATALOGS
" as "CATALOGS.ebz
." eb-4.4.3
specifically looks for either "CATALOGS
" or "CATALOGS.ebz
", and either decompress or compress the files specified within, per user's request, e.g. ebzip -u
will tell it to decompress and rename any of the listed files that are noted in either "CATALOGS
" or "CATALOGS.ebz
", and has ".ebz
" extension, ignoring any other files in the same path, and doesn't decompress "CATALOGS.ebz
" even if instructed. There are ways to workaround that:
Decompress "CATALOGS.ebz
"
In my case,
- Simply rename the decompressed "
HONMON
" to something like "HONMON.old
", - Copy the "
CATALOGS.ebz
" there, but have it named as "HONMON.ebz
", - Run decompression via the usual
ebzip -u
from the top level where it doesn't contain "HONMON.ebz
" but only "CATALOGS.ebz
", - Once decompressed, move decompressed "
HONMON
" (not "HONMON.old
") back to where "CATALOGS.ebz
" was, have it renamed as "CATALOGS
", - Rename "
HONMON.old
" back to "HONMON
".
Optionally, delete "CATALOGS.ebz
" considering that is now redundant.
Decompress other .ebz
that were not listed
Same as above, but instead of copying, you can just move it, as,
-
ebzip -u
will only look for "CATALOGS
" or "CATALOGS.ebz
", and - It saves having to manually delete the redundant file(s) afterwards.
Zero-length (un)compressed files not overwritten upon manual intervention
When a path that has zero-length files, but are listed in either "CATALOGS
" or "CATALOGS.ebz
", ebzip -u
for instance may not overwrite it. This may also apply if ebzip -uf
is used. To fix this, just delete the zero-length files, ensuring that they have a duplicate file that does not have .ebz
extension, then re-run the command. When compressing, the opposite is true, to look for duplicate zero-length files with .ebz
extension, delete them and run the utility to compress instead.
On modern Linux environments, these could be achieved via the following:
Enumerating
find . -size 0 -type f
Deleting
find . -size 0 -type f -exec sudo rm -iv "{}" \;
Care must be taken to avoid any potential losses when running this command.
Anonymoususer852 (talk) 10:10, 18 August 2025 (UTC)
LZMA benchmarks against other compression/utilities
In response to maintainer of ebu
's blog entry,[1] I decided to run a couple of benchmarks with different dictionaries, books or other referencing materials on my end. The following packages and their versions are as follows; kernel-6.15.8, zlib-1.3.1, libdeflate-1.24, xz-5.4.2, dwarfs-0.12.4, squashfs-tools-4.7.1.
zlib versus libdeflate versus LZMA benchmark (single electronic book)
For the following, file sizes are reported from ls -l
, therefore size would be reported in blocks[2]. The tested dictionaries using no compression, zlib, libdeflate, or XZ are tarballed for consistency. For compressed file systems, XZ's LZMA2 is used. Both zlib and libdeflate has a maximum compression level of 5, anything beyond 5 would result in error, however XZ goes up to level 9, and has further options like --extreme
as well as various other tweaks which I will keep it simple and use xz -9evv
instead.
DWARFS example parameters:
mkdwarfs --input . --output=../JMdict_eng_2021-02-26_UTC.dwarfs --block-size-bits=26 --compression=lzma:level=9:extreme --schema-compression=lzma:level=9:extreme --metadata-compression=lzma:level=9:extreme --no-history --pack-metadata=all,force --file-hash=sha3-512 --no-history-timestamp --no-create-timestamp --no-history-command-line
SquashFS example parameters:
mksquashfs . ../JMdict_eng_2021-02-26_UTC.squashfs -b 1048576 -comp xz -Xdict-size 100% -progress -info -no-xattrs
Percentages are rounded to two decimal points, otherwise rounded up. Compression ratio against uncompressed in percentage is offered instead.
kanjidic | NHKAccent | Eijiro | Kenkyusha eiwa daijiten v6 | Kenkyusha waei daijiten v5 | |
No compression | 130,385,920 | 1,578,055,680 | 1,374,771,200 | 484,945,920 | 328,192,000 |
ebzip -z -l 5 (zlib) |
42,035,200 | 1,393,920,000 | 379,975,680 | 117,596,160 | 109,721,600 |
Compression ratio against uncompressed in percentage | 32.24% | 88.33% | 27.63% | 24.25% | 33.43% |
ebzip -z -l 5 (libdeflate) |
40,171,520 | 1,383,505,920 | 353,269,760 | 107,765,760 | 102,113,280 |
Compression ratio against uncompressed in percentage | 30.81% | 87.67% | 25.70% | 22.22% | 31.11% |
xz -9evv -T1 (liblzma) |
21,607,608 | 1,350,900,536 | 197,134,768 | 55,220,788 | 50,109,852 |
Compression ratio against uncompressed in percentage | 16.57% | 85.61% | 14.34% | 11.39% | 15.27% |
xz -9evv -T0 (liblzma) |
21,607,616 | 1,351,560,168 | 202,934,760 | 56,670,664 | 50,747,640 |
Compression ratio against uncompressed in percentage | 16.57% | 85.65% | 14.76% | 11.69% | 15.46% |
mkdwarfs (liblzma) |
22,145,746 | 1,353,699,364 | 212,307,176 | 60,043,728 | 54,541,980 |
Compression ratio against uncompressed in percentage | 16.99% | 85.78% | 15.44% | 12.38% | 16.62% |
mksquashfs (liblzma) |
31,793,152 | 1,367,302,144 | 272,068,608 | 81,805,312 | 160,956,416 |
Compression ratio against uncompressed in percentage | 24.38% | 86.64% | 19.79% | 16.87% | 49.04% |
XZ compression results are provided for the sake of discussion only, more on this in conclusion. While it does show extremely good results against all other utilities, the seek times are abysmally poor, worse than DWARFS with LZMA2 on machines with weaker hardware. Tests confirms liblzma having better compression regardless, against both zlib and libdeflate.
NHKAccent is the exception here, with compression results indicating poor results. This is due to electronic book containing multimedia files, which won't compress well if it was already compressed.
It is worthy to note here that compression/archiving utilities embracing liblzma, could be configured to take advantage of multi-threading by default, at the expense of producing potentially larger file. This however, is a small trade off, for effectively making full use of available CPU in the given environment.
Modern compressed file system with multiple electronic books shootout
Author of ebu
mentioned the case with smartphones, which I do personally find that these are also limited in performance, not just with storage capacities. SquashFS was or is used in Android devices,[3] so it makes sense to try using that on non system files in addition.
For these tests, the sample data consists of ~189 dictionaries in a single directory. Both DWARFS and SquashFS are configured to use XZ's LZMA2 where possible, as it is known to be one of the best, freely available (especially under Unix-like platforms) compression. Both DWARFS and SquashFS parameters are the same as in prior section. Due to the inherent way of how ebuzip
works, requiring "CATALOGS
" to be in the top directory, both variations of ebuzip
using either zlib or libdeflate are omitted from this test. Besides, single compression job per dictionary, not including other files in the that path, e.g. "START
" would make the overall process time consuming, if the process isn't already tedious that it requires extensive scripting.
Method | Size (blocks) |
Uncompressed tar |
29,014,026,240 |
DWARFS | 10,613,886,184 |
SquashFS | 12,269,518,848 |
Where DWARFS shines, is the ability to also save on segmenting data, in addition to file de-duplication. For a small penalty I've given to DWARFS, in that it is to use sha3-512 as hash, it still compresses the collection better and faster, than SquashFS. Then again, the focus here is on overall file size, not performance in file creation.
Conclusion
Now of course, there are plenty of other compression utilities available, such as, but are not limited to Zstandard, BZip3, LZO, etc. From the various results I have read, indicates that these other compression benchmarks, while trading better performance, these fall behind in final size constraints. The benchmark showcase here outlines the need to have as small of a compressed file as possible, with modern compression utilities and modern PC hardware, therefore making (speed-wise) performance testing out of the scope, as is the idea of creating the likes of ZIP bombs, or nested compressed files out of the scope.
I have also tried to run mkfs.erofs
in order to include it against the modern compressed file system with multiple dictionaries shootout. With version 1.8.10, no matter what parameters I set with LZMA level=9
or level=109
(109 is the extreme variant), even with lowering dictsize=
, the results would cause system to run out of memory (16GB of RAM) before even completing the file system creation.
These benchmarks are in no way discrediting or disavowing the use of libdeflate over zlib. If file size is a concern, more modern compression algorithms and/or compression techniques could be considered. Modern EPWING viewer programs generally don't care if the electronic book is compressed using zlib or not compressed at all, and there are many ways of handling decompression transparently in modern Unix-like environments. Even more so, are the ever growing hardware performance as well as affordability with storage capacities in modern computing, questions the use of the what is widely considered as legacy compression algorithms for data archival. Last but not least, while single-threaded compression (as witnessed in the likes of xz -9evv -T1
above) or tasks might be extremely beneficial in reducing overheads, especially with LZMA2, it is also a slow process which may subsequently not translate well in modern, demanding contexts, where time honored traditions are casted away for practical, convenient, modernized, and easy to use solutions.
- ↑ EPWINGを圧縮する (libdeflateで) / Compressing EPWING via libdeflate - Kazuhiro's blog
- ↑ file block size - difference between stat and ls - Unix & Linux - Stack Exchange
- ↑ SquashFS explanation for Android system - Android Enthusiasts - Stack Exchange
Anonymoususer852 (talk) 06:34, 20 August 2025 (UTC)