First post, by Harry Potter
Hi! I'm sorry about posting about this here, for this forum is way off this subject. I'm doing this, as the only help I got with this was to use Adaptive Huffman Codes, but it wouldn't suit my case here. I'm working on a program called PrintTok2. It is a way to compress text for text adventures or similar for old 8- and 16-bit computers. PrintTok1 used tokenization to compress often-occurring snippets of text but didn't compress literals or offer automatic compression. PrintTok2 is supposed to solve these problems. Right now, PrintTok2 has two versions: one that doesn't compress literals and one that uses a modified version of Z-Machine's style and compresses to 5 bits per character. Z-Machine uses three dictionaries to support three different sets of literals and allocates two values to switch dictionaries. I add more literals to the alphabet character, namely more punctuation so PrintTok2 will have to switch dictionaries as often at the cost of one extra bit per some less-often-used letters and the extra punctuation; an extra bit per dictionary-switch to determine if it's just for the current character or for several characters, and tokens. The other version is a naive approach and doesn't compress literals. It is to be divided into three versions: one gives up to 90 one-byte tokens and 128 two-byte tokes, a second 128 one-byte and 128 two-byte tokens at the cost that, on some systems, character assignment may need to be rerouted, and a third to sacrifice 32 one-byte tokens from the second to provide BPE-like functionality, and I have ways to add at least 128 more two-byte tokens. I'm also planning a 7-bit approach which would be the middle ground between the other techniques. Does anybody here have any ideas to better these? I.E. Adaptive Huffman Codes wouldn't suit this, as it requires previous strings to aid in decompression, and PrintTok2 targets would have to print out the strings individually.
Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community