Saturday, May 14, 2011

Long Range Failure

 The next natural step in evaluating MMC would be to test it with large dictionary sizes. Although this can be somewhat "emulated" by simply extending the dictionary size within LZ4HC, it only provides some evaluation on speed, but not on compression gains, since LZ4 output is fixed to 64KB window.

I therefore created a derivative of Zhuff, using an "infinite length" format instead of the fixed 64KB offset one. The change proved relatively painless, keeping almost the same output size with 64KB offset restriction. I therefore proceeded in increasing the dictionary size.

This quickly resulted in a worse compression ratio.
I though my implementation was wrong, but in fact this can be fairly well explained : longer ranges mean larger offsets, which compress using more bits. As a consequence, if i keep the same Minimum Match Length unmodified, it gets compressed less since it requires more bits to describe the larger offset.

The work around to this problem might seem easy : never use large offset for small match length.
Yes, but what is the optimal threshold value ?
That's where problems begin.

It was quick to discover that optimal threshold values were different from one file to another. In a specific example, using enwik8 as input file, the optimal minimum match length ended up being 5 characters, even for very small offset. Most files don't go to such extreme, but nonetheless, the problem is identified : the optimal minimum number of characters at a given offset is different, from one file to another.

A cheap way to achieve "some kind of result" would be to heuristically settle on some kind of "median" value, which is good enough for most files. This would provide some fair improvement over no guard at all.

Fine. But I'm somewhat disappointed with this. I wish i could find something more dynamic, automatically leaning towards "more optimal" threshold for each file.
Indeed, the method to automatically detect if a match found is worthwhile to code or not do exist. It's called "optimal parsing", and its lesser cousin "flexible parsing".

"Optimal parsing" is a very costly operation, in which all positions are searched, their result stored, and a "least distance path" algorithm applied. Flexible parsing do the same, but to a lesser degree, using less distance storage, eventually skipping some positions too.
This is complex enough as is, but there is more to it : in order to calculate distance, we must apply a "cost" to each branch in the graph, which is only possible if we are able to encode the branch cost during evaluation. As long as we keep the coding "forward", this is almost okay for adaptive entropy coders. However, this is a no go for my "block based" implementations, which code the offset "after" LZ parsing.

So that's where i stand today. It seems useless to proceed with larger dictionary sizes without at least some kind of advanced parsing. The naive "greedy parsing" no longer brings automatic benefits with expanding offset costs.