VOGONS

Common searches


OT and Obsolete: Better text compression?

Topic actions

First post, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Hi! I'm sorry about posting about this here, for this forum is way off this subject. I'm doing this, as the only help I got with this was to use Adaptive Huffman Codes, but it wouldn't suit my case here. I'm working on a program called PrintTok2. It is a way to compress text for text adventures or similar for old 8- and 16-bit computers. PrintTok1 used tokenization to compress often-occurring snippets of text but didn't compress literals or offer automatic compression. PrintTok2 is supposed to solve these problems. Right now, PrintTok2 has two versions: one that doesn't compress literals and one that uses a modified version of Z-Machine's style and compresses to 5 bits per character. Z-Machine uses three dictionaries to support three different sets of literals and allocates two values to switch dictionaries. I add more literals to the alphabet character, namely more punctuation so PrintTok2 will have to switch dictionaries as often at the cost of one extra bit per some less-often-used letters and the extra punctuation; an extra bit per dictionary-switch to determine if it's just for the current character or for several characters, and tokens. The other version is a naive approach and doesn't compress literals. It is to be divided into three versions: one gives up to 90 one-byte tokens and 128 two-byte tokes, a second 128 one-byte and 128 two-byte tokens at the cost that, on some systems, character assignment may need to be rerouted, and a third to sacrifice 32 one-byte tokens from the second to provide BPE-like functionality, and I have ways to add at least 128 more two-byte tokens. I'm also planning a 7-bit approach which would be the middle ground between the other techniques. Does anybody here have any ideas to better these? I.E. Adaptive Huffman Codes wouldn't suit this, as it requires previous strings to aid in decompression, and PrintTok2 targets would have to print out the strings individually.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 1 of 67, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

If it's a horror game you can use RLE to get every "eeeeeeeeeeeeeeeeeeeeeeeeeeeeeek" and "woooooooooooooooooooooooo" down to 3 bytes.

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 2 of 67, by Tiido

User metadata
Rank l33t
Rank
l33t

I experimented with dictionary based approach, where all the entire text got processed to find all the unique words and then these got put into a dictionary, with text itself getting replaced with 12 bits per word symbols referening that dictionary, and I got very substantial savings from just that. Punctuation besides space(automatically added after each word output) got its own unique entries in the dictionary, along with some formatting related things.

Next step would have been to apply compression on the dictionary itself and perhaps on the text itself but I never bothered as I already got about 80% or so reduction on what the game was doing and it was good enough. Getting another 10% more wouldn't have made much difference.

T-04YBSC, a new YMF71x based sound card & Official VOGONS thread about it
Newly made 4MB 60ns 30pin SIMMs ~
mida sa loed ? nagunii aru ei saa 😜

Reply 3 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Well, I once helped somebody out with improving his rendition of RLE by suggesting that, instead of applying the character then the count, he use 16 possible character definitions to indicate that the previous character is repeated and how many times it is repeated. He appreciated the help but didn't publish my info online. 🙁 PrintTok1 was to support RLE of spaces by using a tab character for four spaces and a control character and a count for other repeated spaces, but I found that three- and eight-space tokens were better. This wouldn't fit in the 5-bit literals approach but would work well in the 7-bit approach, because I could use two character definitions to indicate such then two extra bits for the length, giving a total of eight possibilities other than the 4-byte tab character. These would work for indents or tables. Any other ideas?

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 4 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Tildo: Thank you! That should work! Maybe I can also apply such for repeated phrases or repeated multiple words, shortening of recently-occurring words and fewer bits per word if not needed. Are you going to publish your method? You beat me to the punchline, and, even though I was beat, I really want the best compressor available. If you publish your technique, I ask you to implement my ideas, and if for profit, I ask for something. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 5 of 67, by Tiido

User metadata
Rank l33t
Rank
l33t

I'm 100% sure the idea I described has been used many times before. There is nothing that is ready for consumption by general public as far as the program etc. goes, i.e there's no user inferface, not even command line, one adjusts variables, compiles and runs the EXE and if all went well there are output files that aren't garbage 🤣. It was never meant for other than internal use.

Program does these steps :
*Parse the text and find all the unique words, using brute force method because on modern hardware parsing 100KB of text takes an instant regardless
*Store the unique words in a list
*Replace all the input text with entires into the list, while taking care of stuff like punctuation and formatting controls
*Show some statistics such as word count, punctuation count and sizes of all the gathered and generated data

In the end you have 2 files :
*Converted text with shorthand symbols in place of every word and punctuation that will have to be processed by display function
*Dictionary with all the unique words and address table to reference each of the words by that display function

Some ways to increase efficiency further would be to add stuff like automatic capitalisation and perhaps even some sort of sentence structure controls for automatic punctuation too. This can allow to get rid of capital letters and as long as you don't need anything special you can reduce data per character to just 5 without numbers or 6 bits with them. I don't know if the gains will be worthwhile however but maybe if there's a lot of text it can make a significant difference.

One can also sort the words and employ some methods to reduce the address table size, and even compress these with some additional methods like huffman coding but that assumes you have RAM to decompress stuff into. In my use case things were destined to go into ROM of a game cartridge.

T-04YBSC, a new YMF71x based sound card & Official VOGONS thread about it
Newly made 4MB 60ns 30pin SIMMs ~
mida sa loed ? nagunii aru ei saa 😜

Reply 6 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Well...I tried your technique and got a negative compression ratio. 🙁 I didn't debug it, though. BTW, I think my approach can work (but maybe barely) on an 8-bit computer. I will look at your other ideas. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 7 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

If I make the function to display the strings on the 5- and 7-bit variants, I can recursively call the print function to display compressed tokens. On 8-bit variants, I can still embed tokens within tokens. It's currently not recursive, because I'm programming for 6502-based machines, and the 6502 doesn't support a true parameter stack, so such would have to be emulated.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 8 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

BTW, when I said "dictionary," I meant a dictionary of literals. I will try Google now and see if I can find more info on Z-Machine's compression technique.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 9 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I found it at https://inform-fiction.org/zmachine/standards … nt1/sect03.html. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 10 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I goofed: I just read that Z-Machine also uses tokenization. 😀 It calls the technique abbreviations. My technique supports up to 256 tokens, though, and other enhancements. I'll read more now.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 11 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

BTW, I have an option on some 8-bit variants to do BPE-like encoding. My test text is small, but it gave me 5 bytes. It couldn't work on other variants. I also tried a 3-char LZ77-like compression, but, after sufficient debugging, I only gained 1 byte, and it wasn't worth the effort in decoding, as fields would vary in length, and this would cost a lot of effort.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 12 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

BTW, If anybody's interested, I have previous versions of PrintTok1 on-line, some for the cc65 C compiler and some hosted on a Commodore 64 and 128. It currently doesn't support ANSI C, but I can provide such support. In case I didn't mention, it doesn't compress literals or offer automatic compression: you have to do it yourself. It's probably worse than Z-Machine and only decreased the size of a text adventure's database file by a little over 37%.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 13 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Maybe I can sacrifice an extra bit per token and add the ability to reference the last few repeated words or other tokens. This wouldn't work in 8-bit encoding, as most tokens are already one-byte tokens, and such a reference would still be one byte. I can't implement it now, as my test text is too small to benefit from it.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 14 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I tried a form of Static Huffman Codes to my 5-bit approach, and it worked, but I was wrong about the way I was writing the numbers: the figures weren't satisfying its prefix-free attribute. 🙁 I want to go back there in a few minutes. BTW, if anybody here wants to apply my techniques, you may. All I ask are that you describe the techniques here you used that are mine in your program's documentation, state they were from me and tell me about your program and where you find it. I'd like a little money for it, though. 😀 I'm only mostly joking.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 15 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I found a bug in my rendition of your technique: I wasn't skipping spaces after words. It works better now but only gave me a slight compression ratio: a decrease of 1.3%. I'm continuing here. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 16 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Another bug fix cost me a little ground again. I believe one of the problems with this technique is that the test text is small and poorly compressible. I'm still working. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 17 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Adding a bit after a punctuation to determine if a space follows seems to help. 😀 Adding a bit before a token to define if it is a repeat of a recent token and shortening such to just an offset to the previous one might also help.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 19 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I was wrong. I took a second look at my program's output and found that most of the excess space was taken up by tokens. I compressed the tokens that were all lower-case letters to 5 bits per character and others to 7 bits. Now, I'm doing 127 bytes compressed and 48.8% compressibility. 😀 I'm also using fewer bits per word, as I currently only need 31 tokens plus the end of string marker. I tried to compress proper-case words using the 5-bit technique, but the function that checks for this doesn't seem to be executed. 🙁 I tried an extra bit after a period to determine whether the next word is capitalized or not, but the edit broke my code. I need to try it again soon. I also want to try this token compression on my other versions of PrintTok2.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community