New binary file format #6

hltbra · 2018-11-12T17:45:07Z

Backward incompatibility

This branch has backward incompatible changes to the file formats of every data structure. There's no longer JSON serialization.

New sorted set binary format

The sorted set file structure is the one with the most changes focusing on optimization. The new structure uses a binary format to store how many entries there are in each score file and also the size of each element before the element data.

I picked 8 bytes for the header because that gives us 2**64 capacity on each sorted set. I also picked 4 bytes for the data size because that gives each element 4GB of space.

The optimization results are approximately the following:

ZRANK: 55% faster
ZCOUNT: 93% faster
ZRANGE: 16% faster
ZREM: 31% faster

I've seen a ~5% time variation in performance test times on my laptop, please take that into account (the results are still very significant).

The new format has an 8-byte header (count) followed by the elements data. Each element is represented by a 4-byte header and a data section. The 4-byte header contains the size of the data section. The data section is the raw data (no serialization). Example of a score file with 2 elements ("hello" and "world"): 0x0002 0x05 hello 0x05 world count size data size data 2 5 "hello" 5 "world"

The extra `seek()` calls were a performance regression

After the binary format for zsets, the JSON serialization became useless

This is for consistency and to help in the future Python 3 migration

Try to read the number of elements found in the file header instead of checking if 0 bytes were returned when reading

All these methods now share the file object (instance attribute)

Before the binary file manipulation, special unicode characters were not handled well

hltbra added 13 commits November 12, 2018 12:27

Change ZCARD performance test to use multiple scores

10cd0df

Optimize zcount to use new zset file format

a4ec792

Move ZSET encoding/decoding to ZSetEncoder class

d64afd8

Fix performance regression after aa12d3

1ff832b

The extra `seek()` calls were a performance regression

Remove unecessary JSON serialization

c942843

After the binary format for zsets, the JSON serialization became useless

Always read files in binary mode

4563551

This is for consistency and to help in the future Python 3 migration

Avoid unnecessary seek on rewrite_content()

ba07af7

Extract method skip_header()

93e0b0d

Simplify read_element by relying on file header

bee05af

Try to read the number of elements found in the file header instead of checking if 0 bytes were returned when reading

Convert class methods into instance methods

a9ccd48

All these methods now share the file object (instance attribute)

Update performance tests to have zset improvement results

b0e6d58

Add a string test to exercise unicode chars

8113842

Before the binary file manipulation, special unicode characters were not handled well

hltbra merged commit 3c3cdbc into master Nov 12, 2018

hltbra deleted the feature/new-zset-format branch November 12, 2018 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New binary file format #6

New binary file format #6

hltbra commented Nov 12, 2018

New binary file format #6

New binary file format #6

Conversation

hltbra commented Nov 12, 2018

Backward incompatibility

New sorted set binary format