VOGONS


Create better compression?

Topic actions

Reply 21 of 28, by wierd_w

User metadata
Rank Member
Rank
Member

This brings back some memories of mine.

I remember coming up with a mathematical tokenization methodology that was rendered "Real-world use-case useless" due to the existence of prime numbers type compression strategem, and being so frustrated that I quit. 😜
(I could universally get several bits worth of savings on input data vs output data, *EXCEPT* when encountering a prime number. [treating every 2 byte chunk as a mathematical integer value] Encoding prime numbers required *MORE* bits than the input stream bits did, and the cumulative consumption perfectly balanced out the 'savings' with real-world data. I was attempting to use a 'least costly factorial expression' representation for every chunk, but prime numbers only factor to themselves and 1, making them always 'cost more'.)

in terms of "repetition detection based" encodings, I found that splitting the data into "Even and Odd Chunk" streams greatly increased repetition in 'already compressed' data as well.

Reply 22 of 28, by clb

User metadata
Rank Member
Rank
Member

It is not completely impossible to create a new compression algorithm that works better (faster, or smaller output) on some specific inputs you care about, especially if you take domain-specific biases (text, code, images) or retro focus hardware instruction sets into account. But it is practically impossible to create a new generic compression algorithm that compresses tighter and faster than the state of the art in the domain of modern hardware. This is because those algorithms have thousands of people who have taken a look at them, and have spent on millions of man hours doing it.

However, none of the modern algorithms are focused on the ancient Commodore 64 or 8088/286 instruction sets, so it is likely that some gains can be found there when comparing against the compressors from the olde days, and e.g. applying modern theories to implement algorithms for old hardware.

Something to include in the benchmark is also the byte size of the decompressor program. Some decompressors have large dictionaries embedded in the decompressor algorithms, which increase the size of the decompressors and give them an edge in compression ratios of small files.

Reply 23 of 28, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Well, the technique should be much better but much slower and larger than my competition considering all I'm doing to compress my files.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 24 of 28, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Yesterday, I made great strides with small files but only slight gains with large test files. Right now, I'm doing slightly worse on large files than before. 🙁 I need to gain more ground there.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community

Reply 25 of 28, by BitWrangler

User metadata
Rank l33t++
Rank
l33t++

Of course for maximum compression of a file, all you need to store are three double precision numbers. Seed, start and stop co-ordinates. Where the seed is the value given to your array of monkeys with typewriters, otherwise known as a pseudorandom number generator, and that chugs away producing a limitless string of random output, and the other two numbers are the start and stop position of your "Shakespeare" inside that random sequence. For a hundred character sample text it would probably only take a month of 10 GPUs chugging away to find the seed value and position. So the compression part is a bit expensive, but man, think about how many complete movies you could get on a floppy disk!

Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.

Reply 26 of 28, by jakethompson1

User metadata
Rank Oldbie
Rank
Oldbie
BitWrangler wrote on 2023-08-11, 14:10:

Of course for maximum compression of a file, all you need to store are three double precision numbers. Seed, start and stop co-ordinates. Where the seed is the value given to your array of monkeys with typewriters, otherwise known as a pseudorandom number generator, and that chugs away producing a limitless string of random output, and the other two numbers are the start and stop position of your "Shakespeare" inside that random sequence. For a hundred character sample text it would probably only take a month of 10 GPUs chugging away to find the seed value and position. So the compression part is a bit expensive, but man, think about how many complete movies you could get on a floppy disk!

I guess this is a joke but there really is the concept of Kolmogorov complexity, where the complexity of the string is the length of the shortest possible program that can produce it. So something like the first million digits of pi would be incompressible with our usual DEFLATE, but could be produced by a small program in comparison.

Reply 27 of 28, by BloodyCactus

User metadata
Rank Oldbie
Rank
Oldbie

I went through this making my executable packer. spend ages rewriting packing routines, save 1 byte here, spend 1 byte somewhere else. compress program A better than all other packers, compress program B 10x worse than all other packers. Thats just how it goes when your looking for redundancy.

--/\-[ Stu : Bloody Cactus :: [ https://bloodycactus.com :: http://kråketær.com ]-/\--

Reply 28 of 28, by Harry Potter

User metadata
Rank Oldbie
Rank
Oldbie

Bad news! A bug fix just cost me 2.3%. 🙁 I'm still doing 3.5% better than 7Zip, but my goal is at least 5% better than 7Zip.

Joseph Rose, a.k.a. Harry Potter
Working magic in the computer community