Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New binary file format #6

Merged
merged 13 commits into from
Nov 12, 2018
Merged

New binary file format #6

merged 13 commits into from
Nov 12, 2018

Conversation

hltbra
Copy link
Contributor

@hltbra hltbra commented Nov 12, 2018

Backward incompatibility

This branch has backward incompatible changes to the file formats of every data structure. There's no longer JSON serialization.


New sorted set binary format

The sorted set file structure is the one with the most changes focusing on optimization. The new structure uses a binary format to store how many entries there are in each score file and also the size of each element before the element data.

I picked 8 bytes for the header because that gives us 2**64 capacity on each sorted set. I also picked 4 bytes for the data size because that gives each element 4GB of space.

The optimization results are approximately the following:

  • ZRANK: 55% faster
  • ZCOUNT: 93% faster
  • ZRANGE: 16% faster
  • ZREM: 31% faster

I've seen a ~5% time variation in performance test times on my laptop, please take that into account (the results are still very significant).

The new format has an 8-byte header (count) followed by the elements data.
Each element is represented by a 4-byte header and a data section.
The 4-byte header contains the size of the data section.
The data section is the raw data (no serialization).

Example of a score file with 2 elements ("hello" and "world"):

    0x0002   0x05 hello     0x05 world
     count   size data      size data
      2       5   "hello"   5    "world"
The extra `seek()` calls were a performance regression
After the binary format for zsets, the JSON serialization
became useless
This is for consistency and to help
in the future Python 3 migration
Try to read the number of elements found in the file header
instead of checking if 0 bytes were returned when reading
All these methods now share the file object (instance attribute)
Before the binary file manipulation, special unicode characters were not handled well
@hltbra hltbra merged commit 3c3cdbc into master Nov 12, 2018
@hltbra hltbra deleted the feature/new-zset-format branch November 12, 2018 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant