Thursday, August 15, 2013

Inter-block compression

 One of the new capabilities defined into the LZ4 Streaming format is inter-block compression. While the concept is simple to formulate (the next block shall be compressed using information from previous block(s)), its execution has been non existent so far.

Up to now, the LZ4 compression utility is doing something relatively simple : cut the data to be processed into independent blocks, compress them one by one (or decode them one by one), assemble the result into destination file or stream.
It is simple and it works.

The problem : the smaller the block size is, the worse the compression ratio.
It can be clearly noticed on the following table :

File : Silesia.tar / Independant blocks
Block Size     LZ4     LZ4 HC
4 MB          47.97%   36.82%   
1 MB          48.03%   36.99%
256 KB        48.27%   37.68%
64 KB          (*)     40.41%

(*) Note : the 64KB version of LZ4 is using a dedicated algorithm which improves compression ratio, and is therefore not comparable with previous block sizes. Nonetheless, for the record, the result is 48.34%.

As can be seen, the situation gets worse and worse with each block size reduction. That's because the first 64KB of each block starts with an empty dictionary, worsening compression capabilities. LZ4 HC is, by the way, much more affected than the Fast LZ4, since it achieves better benefit from earlier data (and therefore loses the most) thanks to its better parsing methodology.
One can also note that, starting with 64KB block size, the decrease in compression ratio is very sensible. It can only get worse from this point towards smaller sizes.

One work around solution is to simply not use small block. It works fairly well : lz4c uses the maximum block size (4MB) for file compression as default.
But there are circumstances when this choice is not possible.

For example, let's suppose your application is sending packets in real time to a server. Because the application is supposed to react in real-time with low-latency, you can't wait to amass 4 MB of data before compressing it and sending it to the server. Data must be sent at application-defined moment, no later.

Such a packet of data is typically small, at most a few KB.
This is very small : if compressing just the packet, the compression is probably going to be poor.

Even more importantly, data packets tend to have a lot of common elements between them, such as headers, and identifiers. But since the repetition can only be noticed across multiple packets, it is lost if each packet is processed independently.

So the idea is to compress the next packet using information from previous packets.

It may seem simple, but this directly contradicts some current assumption in the LZ4 algorithm, which states that each compression function starts from a clean state.

Well, no longer. The soon-to-be release r102 of LZ4 features new compression functions to continue the compression process from where it was stopped previously.
To realize this operation, the compression context is now an explicit argument of the new functions. The context must be created manually, using a dedicated function, and then provided at each round to the new compression functions.
(Note : The usual compression functions are still there, and work the same, so there is no modification for existing LZ4 users.)

As a consequence, these new functions should improve compression ratio. And that's what can be measured :

File : Silesia.tar / Inter-block compression
Block Size     LZ4     LZ4 HC  (HC Independent blocks)
4 MB          47.95%   36.76%      (36.82%) 
1 MB          47.95%   36.76%      (36.99%)
256 KB        47.96%   36.77%      (37.68%)
64 KB         47.97%   36.78%      (40.41%)

There are a few learnings from this experiment :

1) Compression ratio do improve a little bit. This was expected : with previous data available, the initial 64KB of data of each block can be fully compressed at its potential.
One can notice that, once again, LZ4 HC benefit more from it. That's still for the same reason : better parsing, making better usage of available dictionary.

2) Compression ratio do not degrade with smaller block sizes.
Or, to be fair, it does worsen, but by a tiny amount compared to previous situation.

This property is key : it opens the perspective for very small block sizes (such as network packets, of a few KB), while keeping a good compression ratio, close to optimal conditions.

This was the target property of this exercise. It represents a significant step towards an LZ4 Streaming API.