Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seekable streams #7

Open
klauspost opened this issue Mar 27, 2023 · 2 comments
Open

Seekable streams #7

klauspost opened this issue Mar 27, 2023 · 2 comments

Comments

@klauspost
Copy link
Contributor

klauspost commented Mar 27, 2023

Force creation of blocks at fixed boundaries to enable arbitrary position decoding and reversed iteration

The idea of fixed offsets doesn't sound like a great solution to seeking.

The easiest is to limit the "frames" (or whatever you call the collection of blocks) you generate. This will allow you to skip forward just by reading the first byte and skipping that number of bytes.

For sorted seeking, however you would need to add an optional min/max value. You only really need 1 bit to signify a min/max value follows the size. So if you limit the size of frames (for 32 bits), you have plenty of bits left to add an indicator that a min/max value follows.

Say your frames are max 1<<20 entries each. This means that you have 4MB of uint32 data. Considering the decompression speed you don't really need more fine grained seeking. The frame size can easily be configurable on the compression side.

This will as a side effect also allow you to compress and decompress these frames concurrently.

If you want reverse seeking you need an additional block at the end with the frame size, so you can skip backwards.

You can of course also have the index separately, where you for each frame record uncompressed length, compressed length and min/max. This would be similar to the index I made for S2 - except for the min/max part.

A separate index of course means that you don't have to modify anything, except make frames smaller.

@klauspost klauspost changed the title Force creation of blocks at fixed boundaries to enable arbitrary position decoding and reversed iteration Seekable streams Mar 27, 2023
@ronanh
Copy link
Owner

ronanh commented Mar 27, 2023

Thanks for the advice.
I've started some work for this 'fixed boundaries' frames thing, and it adds more complexity that I had thought at first, so you're probably right.

I use your zstd lib extensively, saw this S2 thing in the release notes, but never really looked what it really was: this is really interesting.

@klauspost
Copy link
Contributor Author

klauspost commented Mar 27, 2023

Yeah. Fixed boundaries will limit the compression significantly (unless you can do some magic I cannot think of).

Having a bunch of blocks as a frame (or what you'd call it) that can be a starting point for decompression would be reasonable.

The min/max can of each frame can either cater for the the search in sorted or quick rejection of frames (if looking for outliers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants