Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle DGM scenario for Xor8 filter. #9

Open
prataprc opened this issue Jan 21, 2020 · 1 comment
Open

Handle DGM scenario for Xor8 filter. #9

prataprc opened this issue Jan 21, 2020 · 1 comment

Comments

@prataprc
Copy link
Owner

prataprc commented Jan 21, 2020

Xor8 filter is immutable data-structure that generates a bitmap index
for given set of unique keys. The generated bitmap index has a footprint
of ~ 9 bits for each key. And the sizing requirement for building the
index can be understood with following example.

Keys are u64 types and we cannot incrementally add new keys to existing
bitmap index. So let us say we need to create a bitmap index for
1 million keys:

keys as Vec<u64> = 8MB
stack of keys.len() as Vec<KeyIndex> = 12MB
q0 of block_length as Vec<KeyIndex> = 4.8MB
q1 of block_length as Vec<KeyIndex> = 4.8MB
q2 of block_length as Vec<KeyIndex> = 4.8MB
sets0 of block_length as Vec<XorSet> = 4.8MB
sets1 of block_length as Vec<XorSet> = 4.8MB
sets2 of block_length as Vec<XorSet> = 4.8MB
finger_print = 1.23 MB

block_length ~= (keys.len() * 1.23) / 3

A total of 50MB memory footprint is required for indexing 1 Million
keys.

To index 1 billion keys we may need a footprint of 50GB of memory.

@lemire
Copy link

lemire commented Jan 21, 2020

To index 1 billion keys we may need a footprint of 50GB of memory.

Having a billion keys (64-bit integers) in memory would require 8GB. A billion keys is a quite a lot.

An obvious solution is sharding... divide your data set into subsets, and process each set separately. This would give you several small filters... which is maybe more manageable.

But what if you still want to construct a gigantic, out-of-memory, filter?

Then I think we want to use a sorting-based approach (because sorting can be done out-of-memory with relative ease).

How might this work? I think it is best expressed in code, but I think I can describe the idea. For each key, you have h1(k), h2(k), h3(k). Sort your keys by h1(k). Then go through the sorted list and find all occurrences when the h1(k) value occurs once. Remove these keys from the list and add them to a stack. Then sort what remains with respect to h2(k). And so forth.

If you implement it with some care, it could use no more memory than the size of the input... I guess...

You can use radix sort, so the sort routine can be in linear time (though this might be at the expense of memory).

cc @thomasmueller

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants