VOGONS


OT and Obsolete: Better text compression?

Topic actions

Reply 20 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I believe I was wrong again: I thought the code broke, as I was getting too much compressibility, but when I looked at the numbers again, they are possibly real. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 21 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I added the auto-caps strategy today, and it helped a lot. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 22 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

After a period, a bit will be used to auto-cap the next word if necessary.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 23 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Uhh...it doesn't work. The numbers I was getting were not due to more efficient compression but rather not calculating the tokens. 🙁 I was still getting a count of token size, so maybe it's calculating token size twice, and that could be botching the results.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 24 of 67, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

zmachines 5bit compression is terrible. I would not use it as a starting point. Level9's compression in their Acode or magnetic scrolls huffman was much better.

When i did RAI (retro adventure interpreter) for the c64/amiga/pc, I just used a compressed dictionary. a sentence was just a pointer to a word in a dictionary with 4 bits for length, with a max of 4096 unique longest words, but you only ever store the longest word, then there are not zero string terminators etc. works quite well, then you can encode those words and get even better compression.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 25 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

BloodyCactus: I thank you for your information. I'm in the process of stopping its support in favor of Toldo here's suggestions. I didn't realize it was poor.

I think my problem is an error in the calculations code. When it calculates the code, I get unrealistically bad numbers. When I skip the tally of token size, I still get the phantom numbers: the token size count is way too high. I'm using ANSI-compliant code, and the code follows:

void GetRatio (void)
{
unsigned uncomp=0, compstr=0, comptok=0, comptotal=0, compratio1, compratio1a, compratio2, compratio2a;
unsigned i, j;
curstring=strings; comptok=0;
for (i=0; i<tok2bufsize; i++) {
//comptok+=tokens2[i].len+2;
} while (curstring) {
uncomp+=curstring->inlen;
compstr+=curstring->pass1len;
curstring=curstring->next;
} comptotal=compstr+comptok;
compratio1=compstr*1000/uncomp; compratio1a=1000-compratio1;
compratio2=comptotal*1000/uncomp; compratio2a=1000-compratio2;
printf ("Uncompressed: %d\n", uncomp);
printf ("Compressed (strings): %d\n", compstr);
printf ("Compressed (total): %d\n", comptotal);
printf ("Ratio (%%Of): %d.%d\n", compratio2/10, compratio2%10);
printf ("Ratio (%%Less): %d.%d\n", compratio2a/10, compratio2a%10);
getchar();
}

The problem is that comptotal is way higher than compstr, even though comptok is now 0. It does this even when I don't add comptok to compstr in comptotal. Can anybody here spot what I'm doing wrong? Because I can't.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 26 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Okay: I found my problem: I was either editing the wrong version's source file or compiling the wrong version. 🙁 The strings are compressing much better than on other versions, but the tokens consume a lot of space. I am compressing the tokens as I mentioned and am only getting slight compression. Is there another way I can compress or optimize individual tokens?

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 27 of 67, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

you should always specify a type. unsigned x? (I know it means unsigned int by default, but, always always always specify int/char etc!) or better, use stdint.h and uint8_t or int16_t or whatever.

its also bad form to do "unsigned x,y,z,q,d,p,t,o;" declare each variable separately. makes reading code much easier. I'd also tell you not to put more than one statement on a line, again for readability. compratio1 isnt used, is that an oversight? if compstr or comptotal are > 65, those numbers are meaningless, * 1000 (if your default int is 32bit your probably ok but I have no idea based on your code, and if your not on an 8bit machine just use float or double instead of faffing with %10/10 etc.).

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 28 of 67, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie
Harry Potter wrote on 2024-11-04, 14:39:

BloodyCactus: I thank you for your information. I'm in the process of stopping its support in favor of Toldo here's suggestions. I didn't realize it was poor.

aah i just read back, his reponse matches what I did in RAI like 15-20 years ago. same method, very common way to do it!. my compressor was written in ruby and adventure engine is in c for c64, pc dos, pc linux, amiga.

https://github.com/stu/rai/

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 29 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

If you're interested, I have two text adventure codes for cc65 C. One is for low-end 8-bit computers, and the other hi-end ones. I believe they're very efficient and can support using virtually all of a C64 and 128's memory. I can post a link to them, or I can post a link to where you can find their info online without downloading. Their support for text compression is very weak, though, and requires manual compression. 🙁 Are you interested?

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 30 of 67, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

not really, i don't update RAI on c64 anymore so no need for extra c64 work. my current adventure engine looks like legend entertainment style (eric the unready, spellcasting, time quest etc). and on modern machines no real need for compression to save a few kb. the data is a zip file and uses BZip2 compression. I mean just changing 1 png image and downscaling it saves more space than working on text compression.

what i found when doing all this text compression on c64, the decode routines and tokens and tradeoffs often killed a lot of saved stuff. cc65s was terrible at code space, so making a more sophisticated routine, the compiler blew out code space and other stuff. was not worth it in the end. A dictionary compressed a fixed huffman table like magnetic scrolls would be plenty of compression.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 31 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

You are right: cc65 is very inefficient. However, I've been able to make it much more efficient: I implemented a lot of strategies, and they include the implementation of CBMSimpleIO and the watering down of the crt0.s files. CBMSimpleIO is a very efficient replacement of cc65's standard and console I/O libraries that interfaces directly with the kernal and performs little extra processing. In fact, most of CBMSimpleIO's binaries that quickly test its full functionality, after removing unneeded functionality from the crt0.s files, such as the call to callmain and the library init/done calls, which are not needed to test CBMSimpleIO, cost <1k. I also have these programs and others and most of my optimization techniques at https://sourceforge.net/projects/cc65extra/files/. I am able to say that the mentioned codes are efficient because of these and the size of the empty binaries are very small: the one-file Vic20 version should be no more than ~2.1k--smaller if using my Cubbyhole optimization technique which stuffs a system's Low memory when not needed during the course of your program. Are you interested now?

I've been working on compressing the tokens and gained a lot of progress but am still doing poorly. I have options to shorten all-lower-case words to 5 bits per character, shorten proper-case words the same, proper-case when a lower-case version is available and make other characters not require a EOS character, as they are only one character wide. I also find that treating certain punctuation as letters helps. I'm using a method that limits the size of a token to just enough to hold all used token values. If you want, I can capture the program's output and post some of it here, along with the text that's being compressed, so you can see if I'm doing something wrong. BTW, I have ways to better my other versions of Printtok2, including compression of the 5-bit-style's tokens and fields of the naive approach as useful when, for example, shortening groups of lower-case letters and certain punctuation or numbers or after some fields, a bit to represent whether a space follows.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 32 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I have a version of PrintTok2 that is now doing extremely exceptional. 😀 Basically, it at first didn't compress literals. I'm now compressing blocks of lower-case letters, spaces and periods to five bis per character and other literals to two extra bits per literal but 7 bits for the literal. Now, the numbers are 249 uncompressed, 145 compressed and 41.8% compressibility. 😀 I will tweak it further.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 33 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I was wrong, and I'm sorry! The numbers I was getting were because I kept skipping bytes during compression. 🙁 BTW, Z-Machine's style seems to work for me. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 35 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Well...I got it to do much better: I removed the extra bit per word compressed, added tokens, added more literals to the letters support, such as a period and comma at the cost of one bit per rarely-used letter of the extra values (but will decrease the need to switch dictionares) and add an extra bit to a dictionary-switch to determine if it's only for the current character or for the next several characters. If you want, you can apply these ideas. All I ask are that you describe them, mention they're from me and send me an e-mail telling me about the program and where to get it. I'm an egotist and want to be praised for my work. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 36 of 67, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

If your happy with it, write yourself a new adventure game for the c64 and show it off.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 37 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I actually have a text adventure online. It's called Smir 3, 1 and is available for the C64, Plus4, C128 and I think the Atart XL. You can find it at https://sourceforge.net/projects/cc65extra/files/game/. Its text was compressed for PrintTok1, which is limited and, in the end, compressed its data file by only 37.9%. That's why I'm pursuing this. BTW, I'm trying to not tokenize words that only occur once in the text. It decreased the token size significantly but should've cost in the raw compression. I need to debug it. I believe I also found a possible bug with compressing tokens on my other techniques: I may be comparing the EOS marker in the token while compressing.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 38 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I was wrong: the strings size actually increased by 10 bytes. Right now, the numbers are 125 bytes compressed and 49.6% compressibility. 😀 I will continue tweaking it.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 39 of 67, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

I fixed a problem with it not skipping tokens that occur only once, and now, I'm down to 119 bytes and 52.1%. 😀

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community