Softdisk Text Compressor
|  (→Decoder/Encoder) |  (→Sample files) | ||
| Line 41: | Line 41: | ||
| == Sample files == | == Sample files == | ||
| * [https://www.dan.info/sampledata/X03ONLIN.CTX Sample compressed text file] | * [https://www.dan.info/sampledata/X03ONLIN.CTX Sample compressed text file] | ||
| − | * https://telparia.com/fileFormatSamples/ | + | * https://telparia.com/fileFormatSamples/document/softdiskTextCompressed/ | 
| == Decoder/Encoder == | == Decoder/Encoder == | ||
Revision as of 21:10, 26 October 2022
The Softdisk Text Compresssor was yet another format used by Softdisk Publishing on its diskmagazines in the 1990s. It was not released to the public, but was used as an internal utility to prepare text files for publication, with the decompressor embedded in the "shell" program used to display the articles when the diskmagazine issue was read. It was developed in 1993 and used on some issues of Softdisk PC. It was a simple compression routine designed to squeeze a few bytes out of typical English-language ASCII files in a time when saving a handful of bytes was still important; the diskmagazines were published on floppy disks and expected to be runnable directly from the disk without installing on a hard drive, so getting those couple of bytes out of the file so it went one 1K disk block smaller was critical to resolve a "disk full" error during the deadline crunch to get out an issue. (You kids have it so easy with your terabyte hard drives...)
The basic technique was to represent a bunch of specific character strings (which was taken from a fixed hand-generated list of common character sequences in text files used by Softdisk) as single bytes in the compressed file, with the bytes corresponding to ASCII control characters (other than CR and LF) and 8-bit characters (#128-#254, with #255 reserved as an escape character) used for this purpose. Character #255 signaled either an escaped special character to be treated literally, or a signal to repeat a character, as shown below.
The format was as follows:
The first six bytes were flag bytes to indicate the file was of this format; they were expected to be (hex) 03 43 54 30 30 31, which was Control-C followed by CT001.
Next was the original filename of the uncompressed file, as a variable-length null-terminated string.
Then followed a table of fixed character sequences which were to be substituted for particular characters in the file. First there were 30 sequences of up to 5 characters, which were the uncompressed strings corresponding to the file input characters #0 - #9, #11 - #12, and #14 - #31 (decimal), in other words the C0 controls except for carriage return and linefeed. Each such sequence terminated at 5 characters or when a null (#0) was encountered, whichever came first.
Next was a table of two-character sequences, 127 of them corresponding to the file input characters #128-#254.
After this came the compressed data itself. The characters (bytes) were to be read one at a time, and treated as follows (with anything not itemized below, i.e. the ASCII printable characters, treated as literal characters and output as is):
If it is a CR (#13), output a line break. (Linefeeds were stripped and ignored to save that precious one byte per line.)
If it is #0-#9, #11-#12, or #14-31 (other control characters), replace it with the corresponding string sequence in the up-to-five-character table that was read in earlier.
If it is #128-#254, replace it with the corresponding string sequence in the two-character table that was read in earlier.
If it is #255, read the next character (byte). If it is an ASCII printable character (#32-#127), subtract 30 from the byte value and consider this quantity n. Then read one more character c, and output that character n times. (This is handy for encoding sequences of repeated characters such as dashed lines.)
If the character after the #255 character is anything else, output it as a literal. This allows control characters and high-bit characters to be included in the file, though they become two-character sequences in the "compressed" data which could make the file actually get bigger on compression if there are many characters of this sort.
This compression routine can be used on any files in an 8-bit character encoding, but works best on ones limited to 7-bit ASCII.
See also
Sample files
Decoder/Encoder
A cleanroom implementation can be found at https://git.fsfe.org/art1pirat/ctxer

