Talk:EBZip

1 ebzip caveats
- 1.1 Hardcoded to look specifically for a file that begins with "CATALOGS" (case-insensitive), or "CATALOGS.ebz"
  - 1.1.1 Decompress "CATALOGS.ebz"
  - 1.1.2 Decompress other .ebz that were not listed
- 1.2 Zero-length (un)compressed files not overwritten upon manual intervention
  - 1.2.1 Enumerating
  - 1.2.2 Deleting
2 LZMA benchmarks against other compression/utilities

`ebzip` caveats

Intended to be a place to note down issues, as well as tips and tricks with eb-4.4.3.

Hardcoded to look specifically for a file that begins with `"CATALOGS"` (case-insensitive), or `"CATALOGS.ebz"`

Custom dictionaries doesn't always follow standard convention to store their content(s), like having "START.ebz" in the top level. Even if they do, they may also compress "CATALOGS" as "CATALOGS.ebz." eb-4.4.3 specifically looks for either "CATALOGS" or "CATALOGS.ebz", and either decompress or compress the files specified within, per user's request, e.g. ebzip -u will tell it to decompress and rename any of the listed files that are noted in either "CATALOGS" or "CATALOGS.ebz", and has ".ebz" extension, ignoring any other files in the same path, and doesn't decompress "CATALOGS.ebz" even if instructed. There are ways to workaround that:

Decompress "`CATALOGS.ebz`"

In my case,

Simply rename the decompressed "HONMON" to something like "HONMON.old",
Copy the "CATALOGS.ebz" there, but have it named as "HONMON.ebz",
Run decompression via the usual ebzip -u from the top level where it doesn't contain "HONMON.ebz" but only "CATALOGS.ebz",
Once decompressed, move decompressed "HONMON" (not "HONMON.old") back to where "CATALOGS.ebz" was, have it renamed as "CATALOGS",
Rename "HONMON.old" back to "HONMON".

Optionally, delete "CATALOGS.ebz" considering that is now redundant.

Decompress other `.ebz` that were not listed

Same as above, but instead of copying, you can just move it, as,

ebzip -u will only look for "CATALOGS" or "CATALOGS.ebz", and
It saves having to manually delete the redundant file(s) afterwards.

Zero-length (un)compressed files not overwritten upon manual intervention

When a path that has zero-length files, but are listed in either "CATALOGS" or "CATALOGS.ebz", ebzip -u for instance may not overwrite it. This may also apply if ebzip -uf is used. To fix this, just delete the zero-length files, ensuring that they have a duplicate file that does not have .ebz extension, then re-run the command. When compressing, the opposite is true, to look for duplicate zero-length files with .ebz extension, delete them and run the utility to compress instead.

On modern Linux environments, these could be achieved via the following:

Enumerating

 find . -size 0 -type f

Deleting

 find . -size 0 -type f -exec sudo rm -iv "{}" \;

Care must be taken to avoid any potential losses when running this command.

Anonymoususer852 (talk) 10:10, 18 August 2025 (UTC)

LZMA benchmarks against other compression/utilities

In response to maintainer of ebu's blog entry,^[1] I decided to run a couple of benchmarks with different dictionaries, books or other referencing materials on my end. The following packages and their versions are as follows; kernel-6.15.8, zlib-1.3.1, libdeflate-1.24, xz-5.4.2, dwarfs-0.12.4, squashfs-tools-4.7.1.

zlib versus libdeflate versus LZMA benchmark (single electronic book)

For the following, file sizes are reported from ls -l, therefore size would be reported in blocks^[2]. The tested dictionaries using no compression, zlib, libdeflate, or XZ are tarballed for consistency. For compressed file systems, XZ's LZMA2 is used. Both zlib and libdeflate has a maximum compression level of 5, anything beyond 5 would result in error, however XZ goes up to level 9, and has further options like --extreme as well as various other tweaks which I will keep it simple and use xz -9evv instead.

DWARFS example parameters:

 mkdwarfs --input . --output=../JMdict_eng_2021-02-26_UTC.dwarfs --block-size-bits=26 --compression=lzma:level=9:extreme --schema-compression=lzma:level=9:extreme --metadata-compression=lzma:level=9:extreme --no-history --pack-metadata=all,force --file-hash=sha3-512 --no-history-timestamp --no-create-timestamp --no-history-command-line

SquashFS example parameters:

 mksquashfs . ../JMdict_eng_2021-02-26_UTC.squashfs -b 1048576 -comp xz -Xdict-size 100% -progress -info -no-xattrs

Percentages are rounded to two decimal points, otherwise rounded up. Compression ratio against uncompressed in percentage is offered instead.

	kanjidic	NHKAccent	Eijiro	Kenkyusha eiwa daijiten v6	Kenkyusha waei daijiten v5
No compression	130,385,920	1,578,055,680	1,374,771,200	484,945,920	328,192,000
`ebzip -z -l 5` (zlib)	42,035,200	1,393,920,000	379,975,680	117,596,160	109,721,600
Compression ratio against uncompressed in percentage	32.24%	88.33%	27.63%	24.25%	33.43%
`ebzip -z -l 5` (libdeflate)	40,171,520	1,383,505,920	353,269,760	107,765,760	102,113,280
Compression ratio against uncompressed in percentage	30.81%	87.67%	25.70%	22.22%	31.11%
`xz -9evv -T1` (liblzma)	21,607,608	1,350,900,536	197,134,768	55,220,788	50,109,852
Compression ratio against uncompressed in percentage	16.57%	85.61%	14.34%	11.39%	15.27%
`xz -9evv -T0` (liblzma)	21,607,616	1,351,560,168	202,934,760	56,670,664	50,747,640
Compression ratio against uncompressed in percentage	16.57%	85.65%	14.76%	11.69%	15.46%
`mkdwarfs` (liblzma)	22,145,746	1,353,699,364	212,307,176	60,043,728	54,541,980
Compression ratio against uncompressed in percentage	16.99%	85.78%	15.44%	12.38%	16.62%
`mksquashfs` (liblzma)	31,793,152	1,367,302,144	272,068,608	81,805,312	160,956,416
Compression ratio against uncompressed in percentage	24.38%	86.64%	19.79%	16.87%	49.04%

XZ compression results are provided for the sake of discussion only, more on this in conclusion. While it does show extremely good results against all other utilities, the seek times are abysmally poor, worse than DWARFS with LZMA2 on machines with weaker hardware. Tests confirms liblzma having better compression regardless, against both zlib and libdeflate.

NHKAccent is the exception here, with compression results indicating poor results. This is due to electronic book containing multimedia files, which won't compress well if it was already compressed.

It is worthy to note here that compression/archiving utilities embracing liblzma, could be configured to take advantage of multi-threading by default, at the expense of producing potentially larger file. This however, is a small trade off, for effectively making full use of available CPU in the given environment.

Modern compressed file system with multiple electronic books shootout

Author of ebu mentioned the case with smartphones, which I do personally find that these are also limited in performance, not just with storage capacities. SquashFS was or is used in Android devices,^[3] so it makes sense to try using that on non system files in addition.

For these tests, the sample data consists of ~189 dictionaries in a single directory. Both DWARFS and SquashFS are configured to use XZ's LZMA2 where possible, as it is known to be one of the best, freely available (especially under Unix-like platforms) compression. Both DWARFS and SquashFS parameters are the same as in prior section. Due to the inherent way of how ebuzip works, requiring "CATALOGS" to be in the top directory, both variations of ebuzip using either zlib or libdeflate are omitted from this test. Besides, single compression job per dictionary, not including other files in the that path, e.g. "START" would make the overall process time consuming, if the process isn't already tedious that it requires extensive scripting.

Method	Size (blocks)
Uncompressed tar	29,014,026,240
DWARFS	10,613,886,184
SquashFS	12,269,518,848

Where DWARFS shines, is the ability to also save on segmenting data, in addition to file de-duplication. For a small penalty I've given to DWARFS, in that it is to use sha3-512 as hash, it still compresses the collection better and faster, than SquashFS. Then again, the focus here is on overall file size, not performance in file creation.

Conclusion

Now of course, there are plenty of other compression utilities available, such as, but are not limited to Zstandard, BZip3, LZO, etc. From the various results I have read, indicates that these other compression benchmarks, while trading better performance, these fall behind in final size constraints. The benchmark showcase here outlines the need to have as small of a compressed file as possible, with modern compression utilities and modern PC hardware, therefore making (speed-wise) performance testing out of the scope, as is the idea of creating the likes of ZIP bombs, or nested compressed files out of the scope.

I have also tried to run mkfs.erofs in order to include it against the modern compressed file system with multiple dictionaries shootout. With version 1.8.10, no matter what parameters I set with LZMA level=9 or level=109 (109 is the extreme variant), even with lowering dictsize=, the results would cause system to run out of memory (16GB of RAM) before even completing the file system creation.

These benchmarks are in no way discrediting or disavowing the use of libdeflate over zlib. If file size is a concern, more modern compression algorithms and/or compression techniques could be considered. Modern EPWING viewer programs generally don't care if the electronic book is compressed using zlib or not compressed at all, and there are many ways of handling decompression transparently in modern Unix-like environments. Even more so, are the ever growing hardware performance as well as affordability with storage capacities in modern computing, questions the use of the what is widely considered as legacy compression algorithms for data archival. Last but not least, while single-threaded compression (as witnessed in the likes of xz -9evv -T1 above) or tasks might be extremely beneficial in reducing overheads, especially with LZMA2, it is also a slow process which may subsequently not translate well in modern, demanding contexts, where time honored traditions are casted away for practical, convenient, modernized, and easy to use solutions.

Anonymoususer852 (talk) 06:34, 20 August 2025 (UTC)

[0] EPWINGを圧縮する (libdeflateで) / Compressing EPWING via libdeflate - Kazuhiro's blog

[1] size - difference between stat and ls - Unix & Linux - Stack Exchange

[2] SquashFS explanation for Android system - Android Enthusiasts - Stack Exchange

[1]

[2]

[3]

Talk:EBZip

Contents

`ebzip` caveats

Hardcoded to look specifically for a file that begins with `"CATALOGS"` (case-insensitive), or `"CATALOGS.ebz"`

Decompress "`CATALOGS.ebz`"

Decompress other `.ebz` that were not listed

Zero-length (un)compressed files not overwritten upon manual intervention

Enumerating

Deleting

LZMA benchmarks against other compression/utilities

zlib versus libdeflate versus LZMA benchmark (single electronic book)

Modern compressed file system with multiple electronic books shootout

Conclusion

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox

Talk:EBZip

Contents

ebzip caveats

Hardcoded to look specifically for a file that begins with "CATALOGS" (case-insensitive), or "CATALOGS.ebz"

Decompress "CATALOGS.ebz"

Decompress other .ebz that were not listed

Zero-length (un)compressed files not overwritten upon manual intervention

Enumerating

Deleting

LZMA benchmarks against other compression/utilities

zlib versus libdeflate versus LZMA benchmark (single electronic book)

Modern compressed file system with multiple electronic books shootout

Conclusion

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox

`ebzip` caveats

Hardcoded to look specifically for a file that begins with `"CATALOGS"` (case-insensitive), or `"CATALOGS.ebz"`

Decompress "`CATALOGS.ebz`"

Decompress other `.ebz` that were not listed