Wednesday, February 5, 2014

FSE : Defining optimal subranges

Note : Charles Bloom started an independent in-depth analysis of ANS on his blog, which is definitely worth the read.


 As stated in earlier FSE comparison with Arithmetic Coding, FSE deals with fractional bits by "decreasing" its state value. Now how does that work precisely ?

Let's use again the example of a symbol which probability of appearance is 5/4096. The Shannon limit for this symbol is 9.68 bits per symbol (if we do better, then some other symbols will suffer, and the overall will compress less). As stated earlier, we deal with the fractional bit issue by outputting sometimes 9 bits, sometimes 10 bits.

The distribution of 9-bits & 10-bits sub-ranges must ensure that all possible values remain accessible after the current symbol. It's a condition for lossless compression. The result is the following layout :


OK, so why are the 9-bits subranges positioned at the beginning of the range ?
Well, in previous article, we said that an additional bit is read when the state is so much decreased that is goes below lower limit. In which case, it just "wraps around", starting again from the top value, at the cost of an additional bit read.



So it's not so much that the 9-bits are at the bottom. Rather, with above explanation, it is more intuitive to explain that the 10-bits are at the top.
Maybe it would be easier to notice with another example, a symbol with 7/4096 probability of appearance.



Note that these subranges are "destination ranges". Each of the 7 symbol_cells are mapped to one and only one of these ranges.

So which symbol is associated with each range ?

As stated earlier, since the additional bit is "paid for" when crossing the low limit, the most likely symbol to trigger the limit is lowest one.
So we give an index to each of the 7 symbol_cells, in their order of appearance, from the smaller to the larger one.
The largest 10-bits range is attributed to symbol_cells n°1.
The other ranges simply follow, in their natural order :


This is coherent with an optimal cost associated with a 7/4096 distribution : 9.19 bits.
It means that we are only consuming 0.19 fractional bits each time the symbol is decoded. So we can decode it about 5 times at 9-bits before requiring 10-bits.

This is in contrast with a probability of 5/4096 :



Here, the optimal cost is 9.68 bits, so fractional bits is much higher (0.68 bits). We are likely to cross the low limit much more often. Hence the first 3 symbol_cells trigger the additional bit read, and require 10 bits.

This rule is valid for any probability; so now, whether it is 5/4096, 134/4096, or even 3812/4096, you know how to define destination sub-ranges, and can build the associated decoding table.
(The latest example is interesting, since some sub-ranges will be 0-bits large, which means they map directly to another state value : 3812/4096 => 284 1-bit sub-ranges + 3528 0-bit sub-ranges).

What remains to be understood is how to distribute the symbol states. A naive approach would be to simply cluster them together. For example, a symbol with probability of 7/4096 would occupy state values from 0 to 6. It would work, but compression effectiveness would be very poor.
In another blog post, we'll see why, and how to correct that.



Monday, February 3, 2014

A comparison of Arithmetic Encoding with FSE

 Arithmetic encoding was invented in the 70's by Jorma Rissanen. After a few decades during which the technique almost stagnated, partly as a side-effect of abusive patent strategy, it started to take off in the early 2000's, and is now clearly established as the "current generation" entropy coding stage, thanks to its much better compression effectiveness than Huffman.

Since FSE claims to achieve equivalent compression performance but without the speed penalty, it's interesting to compare them, and see what makes them similar, or different.

The basic concept behind Arithmetic Encoding is the simple fact that shifting a bit is equivalent to a division by 2.

So now, it becomes clear that Huffman is just a special case of Arithmetic Encoding : since Huffman only shift bits, it's equivalent to divisions by power of 2.
By fully acknowledging this equivalence, it becomes obvious that it's also possible to divide by any other number, unlimited by the power of 2 requirement. This is how fractional bit is achieved. You can divide by 1.11, and you achieved the equivalent of a 0.1 bit cost.

The illustration above, taken from Wikipedia, is relatively clear, although it may be even simpler to look at an Huffman one. Let's use again this simple distribution :




Each symbol has an offset (or start position) and a range.
A : start 0, range 4 => factor 2
B : start 4, range 2 => factor 4
C : start 6, range 1 => factor 8
D : start 7, range 1 => factor 8

The "factor" is simply the total divided by the range.

In order to encode a symbol, arithmetic encoding always consider that it's selecting a subrange between 0 and 1, eventually rescaled to match a distribution table. In this case, the table has a size of 8, so we have to output a value between 0 and 7.
To encode symbol B, we restrict our new range between 4 and 6 (excluded). So we first settle with 4, because it is the start value of B.
Then we look at the divisor : we only have 2 positions remaining, so in order to find again the full range of possibility, we have to "upscale" this range by the factor. So we multiply by 4 (e.g. shift left by 2 bits), and find again a range of 8 values to encoded.
We can start again, but now, we have 2 bits which are already determined, occupying higher bit position ('10').

This example is very easy because we have only power of 2. But the exact same mechanism works with any distribution. You could even have a total which is a prime number, and any kind of distribution into it (such as 9,7,1,2 for a total of 19). It would work same : scale the value to 19 slots, add the start value, rescale by the factor, and so on.
Except that, you will have to face rounding errors. In a world of unlimited precision, this would not matter, but alas, that's not how computer works. You have to scale arithmetic precision to a bounded accuracy. It's not necessarily a big issue, since what truly matters is to do the same rounding error on both encoding and decoding (which is a significant problem in itself, especially in a cross-platform environment). Beside, it's equivalent to a small compression loss, and therefore it's important to limit that loss as much as possible.
A common way to do that is to "spread" the segment to a virtual size as large as possible. So, in previous example, we will not have a value between 0 and 18, but rather between 0 and 2^31-1 for example. We accept a small rounding error in doing it, but it's very acceptable.

Now that we have described encoding, let's look at how decoding works.

It's almost the same operation. To simplify decoding, we have a precomputed decoding table, of known size (8 in our first example). We read our first number, using virtual scale, and then scale it back to distribution table size, decode the symbol (B if it's a 4 or 5), get the new range, rescale, read more bits, and so on.

All this works well, but as can be guessed, cost quite a bit of multiplications and divisions in the process. With several clever tricks (typically ensuring that distribution total is a power of 2), it's possible to decrease these operations to a minimum, but alas, one division remains in the decoder, and several multiplications are also there.

Now let's have a look at ANS/FSE decoder.

We start with basically one identical concept : a bit read is equivalent to a division by 2.

A first difference though, is that a bit read is not bit shift.
In contrast with Huffman or Arithmetic, the "state" value is not read directly from the bitstream, it is a virtual value which is maintained alongside the bitstream.
That being said, it can be almost the same : a symbol with a 50% probability will basically divide state by 2, then discover than one bit must be added to keep state within range, and then shift it (equivalent to arithmetic's rescale), and read one bit from the bitStream.

Another difference is that a fractional bit is obtained by reading a variable number of bits. For example, if a symbol should be written using 3.3 bits (equivalent to 10%), the decoder will sometimes read 3 bits, and sometimes read 4 bits.
So we never "rescale by 10" as would do an arithmetic decoder, but rather sometimes rescale by 8, and sometimes by 16 (which, incidentally, is a better fit for binary representation, reducing rounding errors).

Okay, but how is this decided ?

Well, this is the magic of the decoding table, that will be described into another post, but to give you an idea of the bigger principle, here is how I would describe it in simple terms :
the fractional bit is translated by a "decrease" in state value.

When state value is high, state is just decreased, but it doesn't trigger the "additional bit". So, following our 10% example, it means we just read 3 more bits.



When state value is low, it tends to "cross" the lower limit. As a consequence, one more bit is read, and state is regenerated at its full value.



OK, that's roughly how the decoder behaves, from an external perspective.
But it's not enough to fully describe the exact behavior. You cannot expect to subtract a fixed amount from state in order to get your new state value. Unfortunately, it matters if the next state value is exactly 4 or 5, because it will influence some future symbol. So accuracy is more complex to describe. And that will be explained in a later post.