Thursday, July 13, 2017

Dealing with library version mismatch

 Note : this article was initially redacted as an answer to David Jud's comment, but it became long enough to be worth converting into a full blog entry.
In previous article, I attempted to introduce a few challenges related to designing an extensible API.

In this one, I'll cover an associated but more specific topic, on how to handle a library version mismatch.

Version mismatch is a more acute problem in a DLL scenario. In a static linking scenario, the programmer has several advantages :
  • Compiler will catch errors (if a type, or a prototype, has changed for example). This gives time to fix these errors. Of course, the application maintainer will prefer that a library update doesn't introduce any change in existing code, but worst case is, most errors should be trapped before shipping the product.
  • Compiler will automatically adapt ABI changes : if an updated type is larger/smaller than previous version, it will be automatically converted throughout the code base. Same thing happens in case of enum changes : adaptation to new enum values is automatically applied by compiler.
  • Library is available during compilation, which means programmer has a local copy that he can update (or not) according to its requirements.

Well, this last property is not always true : in larger organisations, library might belong to a "validated" pool, which cannot be easily adapted for a specific project. In which case, the user program will either have to host its own local copy, or adapt to the one selected by its organisation.

But you get the idea : problematic version mismatches are likely to be trapped or automatically fixed by the compiler, and therefore should be corrected before shipping a binary. Of course, the less changes, the better. Program maintainers will appreciate a library update as transparent as possible.

For a dynamic library though, the topic is a lot harder.
To begin with, user program typically does not have direct control over the library version deployed on target system. So it could be anything. The library could be more recent, or older, than expected during program development.

Now these two types of mismatches are quite different, and trigger different solutions :

Case 1 - library version is higher than expected

This one can, and should, be solved by the library itself.

It's relatively "easy" : never stop supporting older entry points.
This is of course easier said than done, but to be fair, it's first and foremost a question of respecting a policy, and therefore is not out of reach.
Zstandard tries to achieve that by guaranteeing that any prototype reaching "stable" status will be there "forever". For example, ZSTD_getDecompressedSize(), which has been recently superceded by ZSTD_getFrameContentSize(), will nonetheless remain an accessible entry point in future releases, because it's labelled "stable".

A more subtle applicable problem is ABI preservation, in particular structure size and alignment.
Suppose, for example, that version v1.1 defines a structure of size 40 bytes.
But v1.2 add some new capabilities, and now structure has a size of 64 bytes.
All previous fields from v1.1 are still there, at their expected place, but there are now more fields.

The user program, expecting v1.1, would allocate the 40-bytes version, and pass that as an argument to a function expecting a 64-bytes object. You can guess what will follow.

This could be "manually" worked around by inserting a "version" field and dynamically interpreting the object with the appropriate parser. But such manipulation is a recipe for complexity and errors.
That's why structures are pretty dangerous. For best safety, structure definition must remain identical "forever", like the approved "stable" prototypes.

In order to avoid such issue, it's recommended to use incomplete types. This will force the creation of underlying structure through a process entirely controlled by current library, whatever its version, thus avoiding any kind of size/alignment mismatch.

When above conditions are correctly met, the library is "safe" to use by applications expecting an older version : all entry points are still there, behaving as expected.

Whenever this condition cannot be respected anymore, an accepted work-around is to increase the Major digit of the version, indicating a breaking change.

Case 2 - library version is lower than expected

This one is more problematic.
Basically, responsibility is now on the application side. It's up to the application to detect the mismatch and act accordingly.

In David Jud's comment, he describes a pretty simple solution : if the library is not at the expected version, the application just stops there.
Indeed, that's one way to safely handle the matter.

But it's not always desirable. An application can have multiple library dependencies, and not all of them might be critical.
For example, maybe the user program access several libraries offering similar services (encryption for example). If one of them is not at the expected version, and cannot be made to work, it's not always a good reason to terminate the program : maybe there are already plenty of capabilities available without this specific library, and the program can run, just with less options.

Even inside a critical library dependency, some new functionality might be optional, or there might be several ways to get one job done.
Dealing with this case requires writing some "version dependent" code.
This is not an uncommon situation by the way. Gracefully handling potential version mismatches is one thing highly deployed programs tend to do well.

Here is how it can be made to work : presuming the user application wants access to a prototype which is only available in version v1.5+, it first tests the version number. If condition matches, then program can invoke target prototype as expected. If not, a backup scenario is triggered, be it an error, or a different way to get the same job done.

Problem is, this test must be done statically.
For example, in Zstandard, it's possible to ask for library version at runtime, using ZSTD_versionNumber(). But unfortunately, it's already too late.
Any invocation of a new function, such as ZSTD_getFrameContentSize() which only exists since v1.3.0, will trigger an error at link time, even if the invocation itself is protected by a prior check with ZSTD_versionNumber() .

What's required is to selectively remove any reference to such prototype from compilation and linking stages, which means this code cannot exist. It can be excluded through pre-processor.
So the correct method is to use a macro definition, in this case, ZSTD_VERSION_NUMBER

Example :
size = ZSTD_getFrameContentSize(src, srcSize);
size = ZSTD_getDecompressedSize(src, srcSize);
/* here, 0-size answer can be mistaken for "error", so add some code to mitigate the risk */

That works, but requires to compile binary with the correct version of zstd.h header file.
When the program is compiled on target system, it's a reasonable expectation : if libzstd is present, zstd.h is also supposed to be accessible. And it's reasonable to expect them to be synchronised. There can be some corner case scenarios where this does not work, but let's say that in general, it's acceptable.

The detection can also be done through a ./configure script, in order to avoid an #include error during compilation, should the relevant header.h be not even present on target system, as sometimes the library is effectively optional to the program.

But compilation from server side is a different topic. While this is highly perilous to pre-compile a binary using dynamic libraries and then deploy it, this is nonetheless the method selected by many repositories, such as aptitude, in order to save deployment time. In which case, the binary is compiled for "system-provided libraries", which minimum version is known, and repository can solve dependencies. Hence, by construction, the case "library has a lower version than expected" is not supposed to happen. Case closed.

So, as we can see, the situation is solved either by local compilation and clever usage of preprocessing statements, or by dependency solving through repository rules.

Another possibility exists, and is, basically, the one proposed in ZSTD_CCtx_setParameter() API : the parameter to set is selected through an enum value, and if it doesn't exist, because the local library version is too old to support it, the return value signals an error.

Using safely this API feels a lot like the previous example, except that now, it becomes possible to check library version at runtime :

if (ZSTD_versionNumber() >= 10500) {
   return ZSTD_CCtx_setParameter(cctx, ZSTD_p_someNewParam, value);
} else {
   return ERROR(capability_not_present);

This time, there is no need to be in sync with the correct header.h version. As the version number comes directly at runtime from the library itself, it's necessarily correct.

Note however that ZSTD_CCtx_setParameter() only covers the topic of "new parameters". It cannot cover the topic of "new prototypes", which still requires using exclusion through macro detection.

So, which approach is best ?

Well, that's the good question to ask. There's a reason the new advanced API is currently in "experimental" mode : it needs to be field tested, to experience its strengths and weaknesses. There are pros and cons to both methods.
And now, the matter is to select the better one...

Thursday, July 6, 2017

The art of designing advanced API

 A library API (Application Programming Interface) is even more important than its implementation.

There are many reasons for this statement :
- An API exposes a suitable abstraction. Should it prove broken, unclear or just too complex, the library will be misused, which will ultimately be paid by users' time (multiplied by nb of users).
- An API is a contract. Break it, and existing applications can no longer work with your library. Adaptation cost is once again paid by users' time (if it ever happens).
- Because of this, API tend to stick around for a long time, much longer than underlying implementation.

If an implementation is modified to provide, say, a 5% speed improvement, it's all free, every user can immediately benefit about it without further hassle. But if one has to add a single parameter, it's havoc time.

Because it's so stressful to modify an API, one can be tempted to look very hard once, in order to design and expose a perfect API, one that will stand the test of time, and will never need to be changed. This search is (mostly) a delusion.
- perfect API, embedding such a strong restriction to never change in the future, can take forever to build, all under intense stress, as there is always a question mark hanging around : "is there a use case that it does not cover ?". Eventually, it's only a matter of time before you (or your users) find one.
- perfect API lean towards "complex API", as the requirement to support everything makes it add more parameters and control, becoming more and more difficult to manage by users.
- "complex" quickly leads to "misleading", as supporting some "future scenarios" for which there is no current implementation, and maybe no current need, will be categorised bad ideas after all, but side-effects of this useless abstraction will remain in the API.

So, the next great idea is to plan for API changes.
The way Zstandard library tries to achieve this is by quickly converging towards some very simple prototypes, which offer "core" functionalities at a low complexity level.
Then, more complex use cases, not covered by simple API, do show up, and the need to serve them introduce the creation of an "experimental section", a place where it's still possible to play with API, trying to find an optimal abstraction for intended use case, before moving into "stable" status (aka: this method will no longer change).

A consequence of this strategy is the creation of more and more prototypes, dedicated to serving their own use case.
Need to compress with dictionary ? Sure, here comes ZSTD_compress_usingDict() .
Need to process data in a streaming fashion ? Please find  ZSTD_compressStream() .
In a streaming fashion with a dictionary ? ZSTD_compressStream_usingDict() .
Need control over specific parameters ? Go to _advanced() variants.
Preprocess dictionary for faster loading time ? Here are _usingCDict() variants.
Some multithreading maybe ? ZSTDMT_*()
Combine all this ? Of course. Here is a list of a gazillion methods.

As one can see, this doesn't scale too well. It used to be "good enough" for a dozen or so methods, but as combinatorial complexity explodes, it's no longer suitable.

In latest release of Zstandard, we try to get a fresh look to this situation, and provide an API simpler to manage. The result of which is the new advanced API candidate, which actually stands a chance to become stable one day.

It features 2 core components : 
ZSTD_compress_generic() is the new main compression method. It's designed to support all other compression methods. It can do block, streaming, dictionary, multithreading, and any combination of those. We have plan for even more extensions, and they all seem to fit in.

This is possible because now sending parameters is a separate operation, which can be completed in as many steps as necessary.
The main vehicle to set these parameters is ZSTD_CCtx_setParameter() .
It uses an enum based policy, where the parameter is selected in an enum list, and new value is provided as an unsigned type.
This design has been favoured over previous one, which was using a struct to pass all parameters in a single step. The struct was inconvenient as it forced user to select a value for each and every parameter, even out-of-scope ones, in order to change just one of them. Perhaps even more importantly, the struct is problematic for future changes : adding any new parameter would change the struct size, which is an ABI break. It can quickly get ugly when the program and library work on common memory areas using different sizes.
The enum policy allow us to add any new parameter while preserving API and ABI, so it looks very flexible.

However, it comes with its own set of troubles.
To begin with, enum values can very easily change : just add a new enum in the list, and see all enum values after that one slide by one.
It can be a problem if, in a version of the library, ZSTD_p_compressionLevel is attributed a 2, but in a future version, it becomes a 3. In a dynamic library scenario, where version mismatch can easily happen, it means the caller is changing some other random parameter.
To counter that, it will be necessary to pin down all enum values to a manually assigned one. This will guarantee the attribution.

Another issue is that the value of the parameter is provided as an unsigned type, so the parameter must fit this type. That's not always possible.
For example, there is a dedicated method to set pledgedSrcSize, which is essentially a promise about how much data is going to be processed. This amount can be very large, so an unsigned type is not enough. Instead, we need an unsigned long long, hence a dedicated method.
Another even more obvious one happens when referencing a prepared dictionary in read-only mode : this parameter is a const ZSTD_CDict* type, so it is  set through a dedicated method, ZSTD_CCtx_refCDict().
And we have a few other exceptions using their own method, as the argument cannot fit into an unsigned.

But the large majority of them uses ZSTD_CCtx_setParameter().
In some cases, the adaptation works though it's not "best".
For example, a few parameters are selected among a list of enums, for example ZSTD_strategy . The enum is simply casted to an unsigned and passed as argument. It works. But it would have been even nicer to keep the argument type as the intended enum, giving the compiler a chance to catch potential type mismatch (example).

So this design could be in competition with another one : define one method per parameter. The most important benefit would be that each parameter can have its own type.
But this alternate design has also its own flaws :
adding any new parameter means adding a method. Therefore, if a program uses a "recent" method, but links against an older library version, this is a link error.
In contrast, the enum policy would just generate an error in the return code, which can be identified and gracefully dealt with.

Creating future-proof API is hard. There is always a new unexpected use case which shows up and would require another entry point or another parameter. The best we can do is plan for those changes.
The new Zstandard's advanced API tries to do that. But since it is a first attempt, it likely is perfectible. 
This is design time, and it will cost a few revisions before reaching "stable" status. As always, user feedback will be key to decide if the new design fits the bill, and how to amend it.

Follow up : Dealing with library version mismatch

Edit :
Arseny Kapoulkine made an interesting comment, arguing that specialized entry points make it possible for compiler's DCE (Dead Code Elimination) optimisation to kick in, removing useless code from the final binary :

In general this is true. Calling
is clear for the linker,
then it's possible to remove any potentially unused specialized_function2() from binary generation.

In contrast calling
generic_function(mode=1, ...)
void generic_function(int mode, ...)
   switch(mode) {
      case 1 : return specialized_function1(...);
      case 2 : return specialized_function2(...);

makes it much more difficult. In general, for the second case, specialized_function2() will remain in the binary.
(an exception being usage of inline, associated with -flto, and non-ambiguous selection parameter, but that's no longer a "simple" setup).

For Zstandard though, it doesn't make a lot difference.
The reason is, whatever "specialized" entry point is invoked, (ZSTD_compress(), or ZSTD_compress_usingDict() for example), it's just an entry point. The compression code is not "immediately behind", it's reached after several indirection levels. This design make it possible for a single compression code to address multiple usage scenarios with vastly different set of parameters, which is vital for maintenance. But disentagling that for DCE is a lot more difficult.
If required, -flto makes it possible to optimize size better, and some difference become visible, but remain relatively small.

Tuesday, July 5, 2016

Specification of Zstandard compression format

 With the compression format stabilized in v0.7.x serie, Zstandard gets now a first version of its formal specification :
If you ever wanted to know how the algorithm works, and / or wanted to create your own version in any language of your choice, this is the place to start.

It is a first version though, with usual caveats : expect it to be perfectible and require a few rounds, feedbacks and modifications, before reaching a stage of being unambiguous and clear.
This is an opened public consultation phase, every feedback is welcomed.
It's also the very last chance to review the different choices that made it into the format, introducing questions and possibly suggesting improvements or simplifications.
I don't expect "big changes", but maybe a collection of very minor things, which could, collectively, be worth considering a last polishing touch before pushing to v1.0.

Edit : Indeed, there will be a polishing stage...
Writing the specification made it possible to grab a complete view of the multiple choices which made it into the format. Retrospectively, some of these choices are similar yet slightly different. For example, encoding types exist for all symbols, but are not numbered in the same way. Most fields are little-endian, but some are big-endian, some corner cases optimizations are so rare they are not worth their complexity, etc.
Therefore, in an effort to properly unify every minor detail of the specification and bring a few simplifications, a last modification round will be performed. It will be released as 0.8. No major change to expect, only a collection of minor ones. But a change is a change, so it's nonetheless a new format.
As usual, 0.8 will be released with a "legacy mode", allowing reading data already compressed with 0.7.x series and before.
Unlike usual though, we plan to release a "v0.7 transition" version, able to read data created with v0.8, in order to smooth transition in live systems which depend on listeners / producers, and need to ensure all listeners are able to read data sent to them before upgrading to 0.8.

Edit 2 :
v0.8.0 and "transition v0.7.5" have been released

Friday, June 17, 2016

Zstandard reaches Candidate Compression Format

 Finally. That was a pretty long journey.
With the release of v0.7, Zstandard has reached an important milestone where the compression format is stable and complete enough to pretend becoming v1.0.
We don't call it v1.0 yet, because it's safer to spend some time on a "confirmation period" during which the final compression format is field-tested. It shall confirm its ability to match its objectives, dealing with all situations it is planned for.
Then it will be rebranded v1.0.

With the source code out, it's also time to think about other supportive actions, such as documentation. The next priority task is to redact a specification of the compression format, so that it can be better exposed, understood and implemented by third parties. The goal is that any third party should be able to create its own version. However, describing algorithm in a way which is clear and concise is not trivial. It's expected that some paragraphs will need re-write in an effort to become clearer and more complete, reducing sources of confusion. So this effort will benefit from user exposure and feedback

It's also time to have some more involved discussions around the API.
The current "stable API" is expected to remain, but its scope is also limited, providing mostly the "basics", such as compressing a buffer into another buffer. More complex usages are possible (streaming in memory-constrained environment using a custom allocator for example), but need to access advanced prototypes, exposed in the "experimental" section.
Now will be a good time to seriously consider extending the scope of "stable API".

Just "promoting" a prototype from "experimental" to "stable" is not necessarily the better way to go. It's important that the extended API remain simple enough to understand and use (which is not the main priority of "experimental" section).
After being immersed in the code for so long, some technical complexities can become invisible, while becoming real obstacles to newcomers. Therefore, it's important to "think" extended API properly, to create interfaces easy to use.
For this objective, the key is to listen 3rd parties, in order to better fit natural expectations.

Friday, May 13, 2016

Finalizing a compression format

With Zstandard v1.0 looming ahead, the last major item for zstd to settle is an extended set of features for its frame encapsulation layer.

Quick overview of the design : data compressed by zstd is cut into blocks. A compressed block has a maximum content size (128 KB), so obviously if input data is larger than this, it will have to occupy multiple blocks.
The frame layer organize these blocks into a single content. It also provides to the decoder a set of properties that the encoder pledges to respect. These properties allow a decoder to prepare required resources, such as allocating enough memory.

The current frame layer only stores 1 identifier and 2 parameters  :
  • frame Id : It simply tells what are the expected frame and compression formats for follow. This is currently use to automatically detect legacy formats (v0.5.x, v0.4.x, etc.) and select the right decoder for them. It occupies the first 4 bytes of a frame.
  • windowLog : This is the maximum search distance that will be used by the encoder. It is also the maximum block size, when (1<<windowLog) < MaxBlockSize (== 128 KB). This is enough for a decoder to guarantee successful decoding operation using a limited buffer budget, whatever the real content size is (endless streaming included).
  • contentSize : This is the amount of data to decode within this frame. This information is optional. It can be used to allocate the exact amount of memory for the object to decode.

These information may seem redundant.
Indeed, for a few situations, they are : when contentSize  < (1<<windowLog). In which case, it's enough to allocated contentSize bytes for decoding, and windowLog is just redundant.
But for all other situations, windowLog is useful : either contentSize is unknown (it wasn't known at the beginning of the frame and was only discovered on frame termination), or windowLog defines a smaller memory budget than contentSize, in which case, it can be used to limit memory budget.

That's all there is for v0.6.x. Arguably, that's a pretty small list.

The intention is to create a more feature complete frame format for v1.0.
Here is a list of features considered, in priority order :
  • Content Checksum : objective is to validate that decoded content is correct.
  • Dictionary ID : objective is to confirm or detect dictionary mismatch, for files which require a dictionary for correct decompression. Without it, a wrong dictionary could be picked, resulting in silent corruption (or an error).
  • Custom content, aka skippable frames : the objective is to allow users to embed custom elements (comments, indexes, etc.) within a file consisting of multiple concatenated frames.
  • Custom window sizes, including non power of 2 : extend current windowLog scheme, to allow more precise choices.
  • Header checksum : validate that checksum informations are not accidentally distorted.
Each of these bullet points introduce its own set of questions, that is detailed below :

Content checksum
The goal of this field is obvious : validate that decoded content is correct. But there are many little details to select.

Content checksum only protects against accidental errors (transmission, storage, bugs, etc). It's not an electronic "signature".

1) Should it be enabled or disabled by default (field == 0) ?

Suggestion : disabled by default
Reasoning : There are already a lot of checksum around, in storage, in transmission, etc. Consequently, errors are now pretty rare, and when they happen, they tend to be "large" rather than sparse. Also, zstd is likely to detect errors just by parsing the compressed input anyway.

2) Which algorithm ? Should it be selectable ?

Suggestion : xxh64, additional header bit reserved in case of additional checksum, but just a single one defined in v1.
Reasoning : we have transitioned to a 64-bits world. 64-bits checksum are faster to generate than 32-bits ones on such systems. So let's use the faster ones.
xxh64 also has excellent distribution properties, and is highly portable (no dependency on hardware capability). It can be run in 32-bits mode if need be.

3) How many bits for the checksum ?

Current format defines the "frame end mark" as a 3-bytes field, the same size as a block header, which is no accident : it makes parsing easier. This field has a 2-bits header, hence 22 bits free, which can be used for a content checksum. This wouldn't increase the frame size.

22-bits means there is a 1 in 4 millions chances of collision in case of error. Or said differently, there are 4194303 chances out of 4194304 to detect a decoding error (on top of all the syntax verification which are inherent to the format itself). That's more than > 99.9999 %. Good enough in my view.

Dictionary ID

Data compressed using a dictionary needs the exact same one to be regenerated. But no control is done on the dictionary itself. In case of wrong dictionary selection, it can result in a data corruption scenario.

The corruption is likely to be detected by parsing the compressed format (or thanks to the previously described optional content checksum field).
But an even better outcome would be detect such mismatch immediately, before starting decompression, and with a clearer error message/id than "corruption", which is too generic.

For that, it would be enough to embed a "Dictionary ID" into the frame.
The Dictionary ID would simply be a random value stored inside the dictionary (or an assigned one, provided the user as a way to control that he doesn't re-use the same value multiple times). A comparison between the ID in the frame and the ID in the dictionary will be enough to detect the mismatch.

A simple question is : how long should be this ID ? 1, 2, 4 bytes ?
In my view, 4 bytes is enough for a random-based ID, since it makes the probability of collision very low. But that's still 4 more bytes to fit into the frame header. In some ways it can be considered an efficiency issue.
Maybe some people will prefer 2 bytes ? or maybe even 1 byte (notably for manually assigned ID values) ? or maybe even 0 bytes ?

It's unclear, and I guess multiple scenarios will have different answers.
So maybe a good solution would be to support all 4 possibilities in the format, and default to 4-bytes ID when using dictionary compression.

Note that if saving headers is important for your scenario, it's also possible to use frame-less block format ( ZSTD_compressBlock(), ZSTD_decompressBlock() ), which will remove any frame header, saving 12+ bytes in the process. It looks like a small saving, but when the corpus consists of lot of small messages of ~50 bytes each, it makes quite a difference. The application will have to save metadata on its own (what's the correct dictionary, compression size, decompressed size, etc.).

Custom content

Embedding custom content can be useful for a lot of unforeseen applications.
For example, it could contain a custom index into compressed content, or a file descriptor, or just some user comment.

The only thing that a standard decoder can do is skip this section. Dealing with its content is within application-specific realm.

The lz4 frame format already defines such container, as skippable frames. It looks good enough, so let's re-use the same definition.

Custom window sizes

The current frame format allows defining window sizes from 4 KB to 128 MB, all intermediate sizes being strict power of 2 (8 KB, 16 KB, etc.). It works fine, but maybe some user would find its granularity or limits insufficient.
There are 2 parts to consider :

- Allowing larger sizes : the current implementation will have troubles handling window sizes > 256 MB. That being said, it's an implementation issue, not a format issue. An improved version could likely work with larger sizes (at the cost of some complexity).
From a frame format perspective, allowing larger sizes can be as easy as keeping a reserved bit for later.

- Non-power of 2 sizes : Good news is, the internals within zstd are not tied to a specific power of 2, so the problem is limited to sending more precise window sizes. This requires more header bits.
Maybe an unsigned 32-bits value would be good enough for such use.
Note that it doesn't make sense to specify a larger window size than content size. Such case should be automatically avoided by the encoder. As to the decoder, it's unclear how it should react : stop and issue an error ? proceed with allocating the larger window size ? or use the smaller content size, and issue an error if the content ends up larger than that ?
Anyway, in many cases, what the user is likely to want is simply enough size for the frame content. In which case, a simple "refer to frame content size" is probably the better solution, with no additional field needed.

Header Checksum

The intention is to catch errors in the frame header before they translate into larger problems for the decoder. Note that only errors can be caught this way : intentional data tampering can simply rebuild the checksum, hence remain undetected.

Suggestion : this is not necessary.

While transmission errors used to be more common a few decades ago, they are much less of threat today, or they tend to garbage some large sections (not just a few bits).
An erroneous header can nonetheless be detected just by parsing it, considering the number of reserved bits and forbidden value. They must all be validated.
The nail in the coffin is that we do no longer trust headers, as they can be abused by remote attackers to deliver an exploit. And that's an area where the header checksum is simply useless. Every field must be validated, and all accepted values must have controllable effects (for example, if the attacker intentionally requests a lot of memory, the decoder shall put a high limit to the accepted amount, and check the allocation result).
So we already are highly protected against errors, by design, because we must be protected against intentional attacks.

Future features : forward and bakward compatibility

It's also important to design from day 1 a header format able to safely accommodate future features, with regards to version discrepancy.

The basic idea is to keep a number of reserved bits for these features, set to 0 while waiting for some future definition.

It seems also interesting to split these reserved bits into 2 categories :
- Optional and skippable features : these are features which a decoder can safely ignore, without jeopardizing decompression result. For example, a purely informational signal with no impact on decompression.
- Future features, disabled by default (0): these features can have unpredictable impact on compression format, such as : adding a new field costing a few more bytes. A non-compatible decoder cannot take the risk to proceed with decompression. It will stop on detecting such a reserved bit to 1 and gives an error message.

While it's great to keep room for the future, it should not take a too much toll in the present. So only a few bits will be reserved. If more are needed, it simply means another frame format is necessary. It's enough in such case to use a different frame identifier (First 4 bytes of a frame).

Sunday, April 3, 2016

Working with streaming

 Streaming, an advanced and very nice processing mode that a few codecs offer to deal with small data segments. This is great in communication scenarios. For lossless data compression, it makes it possible to send tiny packets, in order to create a low-latency interaction, while preserving strong compression capabilities, by using previously sent data to compress following packets.

Ideally, on the encoding side, the user should be able to send any amount of data, from the smallest possible (1 byte) to much larger ones (~~MB). It's up to the encoder to decide how to deal with this. It may group several small fields into a single packet, or conversely break larger ones into multiple packets. In order to avoid any unwanted delay, a "flush" command shall be available, so that the user can decide it's time to send buffered data.

On the other side, a compatible decoder shall be able to cope with whatever data was sent by the encoder. This obviously requires a bit of coordination, a set of shared rules.

The zip format defines a maximum copy distance (32 KB). Data is sent as a set of blocks, but there is no maximum block size (except non-compressed blocks, which must be <= 64 KB).
A compatible zip decoder must be able to cope with these conditions. It must keep up to 32 KB of previously received data, and be able to break decoding operation in the middle of a block, should it receive a block way too large to fit into its memory buffer.
Thankfully, once this capability is achieved, it's possible to decode with a buffer size of 32 KB + maximum chunk size, with "chunk size" being the maximum size the decoder can decode from a single block. In general, it's a bit more than that, in order to ease a few side-effects, but we won't go into details.

The main take-away is : buffer size is a consequence of maximum copy distance, plus a reasonable amount of data to be decoded in a single pass.

zstd's proposition is to reverse the logic : the size of the decoder buffer is set and announced in its frame header. The decoder can safely allocate the requested amount of memory. It's up to the encoder to respect this condition (otherwise, compressed data is considered corrupted).

In current version of the format, this buffer size can vary from 4 KB to 128 MB. It's a pretty wide range, and crucially, it includes possibilities for small memory footprint. A decoder which can only handle small buffer sizes can immediately detect and discard frames which ask for more than its capabilities.

Once the buffer size is settled, data is sent as "blocks". Each block has a maximum size of 128 KB. So, in theory, a block could be larger than the agreed decoder buffer. What would happen in such case ?

Following zip example, one solution would be for the decoder to be able to stop (and then resume) decoding operation in the middle of a block. This obviously increases decoder complexity. But the benefit is that the only condition the compressor has to respect is a max copy distance <= buffer size.

On the decoder side though, it's only one side of the problem. It's no point having a very small decoding buffer if some other memory budget dwarf it.

The decoding tables are not especially large : they use 5 KB by default, and could be reduced to half, or possibly a quarter of that (but with impact on compression ratio). Not a big budget.

The real issue is the size of the incoming compressed block. A compressed block must be smaller than its original size, otherwise it will be transmitted in uncompressed format. That still makes it possible to have a (128 KB - 1) block size. This is extremely large compared to a 4 KB buffer.

Zip's solution is that it's not necessary to receive the entire compressed block in memory in order to start decompressing it. This is possible because all symbols are entangled in a single bitstream, which is read in forward direction. So input buffer can be a fraction of a block. It simply stops when there is no more information available.

This will be difficult to imitate for zstd : it has multiple independent bitstreams (between 2 and 5) read in backwards direction.

The backward direction is unusual, and a direct consequence of using ANS entropy : encoding and decoding must be done in reverse direction. FSE solution is to write forward and read backward.
It could have been a different choice : write backward, read forward, as suggested by Fabian Giesen. But it makes the encoder's API more complex : the destination buffer would be filled from the end, instead of the beginning. From a user perspective, it breaks a few common assumptions, and become a good recipe for confusion.
Alternatively, the end result could be memmove() to the beginning of the buffer, with a small but noticeable speed cost.

But even that wouldn't solve the multiple bitstreams design, which is key to zstd's speed advantage. zstd is fast because it manages to keep multiple cpu execution units busy. This is achieved by reducing or eliminating dependencies between operations. At some point, it implies bitstream independence.

In a zstd block, literals are encoded first, followed by LZ symbols. Bitstreams are not entangled : each one occupy its own memory segment.
Considering this setup, it's required to access the full content block to start decoding it (well, more precisely, a few little things could be started in parallel, but it's damn complex and not worth the point here).

Save any last-minute breakthrough on this topic, this direction is a dead-end : any compressed block must be received entirely before starting its decompression.
As a consequence, since small decoding buffer is a consequence of constrained memory budget, it looks logical that the size of incoming compressed blocks should be limited too, to preserve memory.

The limit size of a compressed block could be a dedicated parameter, but it would add complexity. A fairly natural assumption would be that a compressed block should be no larger than the decoding buffer. So let's use that.
(PS : another potential candidate would be cBlockSize <= bufferSize/2 , but even such a simple division by 2 looks like a recipe for future confusion).

So now, the encoder side enforces a maximum block size no larger than the decoding buffer. Fair enough. Multiple smaller blocks also means multiple headers, so it could impact compression efficiency. Thankfully, zstd includes both a "default statistics" and an experimental "repeat statistics" modes, which can be used to reduce header size to zero, and provide some answer to this issue.

But there is more to it.
Problem is, amount of data previously sent can be any size. The encoder may arbitrarily receive a "flush" order at any time. So each received block can be any size (up to maximum), and not necessarily fill the buffer.
Hence, what happens when we get closer to buffer's end ?

Presuming the decoder doesn't have the capability to stop decompression in the middle of a block, the next block shall not cross the limit of the decoder buffer. Hence, if there are 2.5 KB left in decoder buffer before reaching its end, the next block maximum size must be 2.5 KB.

It becomes a new condition for the encoder to respect : keep track of decoder buffer fill level, ensure to never cross the limit, stop at exact end of the buffer, and then restart from zero.
It looks complex, but the compressor knows the size of the decoder buffer : it was specified at the beginning of the frame. So it is manageable.

But is that desirable ?
From an encoder perspective, it seems better to get free of such restriction, just accept the block size and copy distance limits, and then let the decoder deal with it, even if it requires a complex capability of "stop and resume" in the middle of a block.
From a decoder perspective, it looks better to only handle full blocks, and require the encoder to pay attention to never break this assumption.

Classical transfer of complexity.
It makes for an interesting design choice. And as v1.0 gets nearer, one will have to be selected.

Edit : And the final choice is :

Well, a decision was necessary, so here it is :

The selected design only impose distance limit and maximum block size to the encoder , both values being equal, and provided in the frame header.
The encoder doesn't need to track the "fill level" of the decoder buffer.

As stated above, a compliant decoder using the exact buffer size should have the capability to break decompression operation in the middle of a block, in order to reach the exact end of the buffer, and restart from the beginning.

However, there is a trick ...
Should the decoder not have this capability, it's enough to extend the size of the buffer by the size of a single block (so it's basically 2x bigger for "small" buffer values (<= 128 KB) ). In which case, the decoder can safely decode every blocks in a single step, without breaking decoding operation in the middle.

Requiring more memory to safely decompress is an "implementation detail", and doesn't impact the spec, which is the real point here.
Thanks to this trick, it's possible to immediately target final spec, and update the decoder implementation later on, as a memory optimization. Therefore, it won't delay v1.0.