Handle DGM scenario for Xor8 filter. #9

prataprc · 2020-01-21T05:39:45Z

Xor8 filter is immutable data-structure that generates a bitmap index
for given set of unique keys. The generated bitmap index has a footprint
of ~ 9 bits for each key. And the sizing requirement for building the
index can be understood with following example.

Keys are u64 types and we cannot incrementally add new keys to existing
bitmap index. So let us say we need to create a bitmap index for
1 million keys:

keys as Vec<u64> = 8MB
stack of keys.len() as Vec<KeyIndex> = 12MB
q0 of block_length as Vec<KeyIndex> = 4.8MB
q1 of block_length as Vec<KeyIndex> = 4.8MB
q2 of block_length as Vec<KeyIndex> = 4.8MB
sets0 of block_length as Vec<XorSet> = 4.8MB
sets1 of block_length as Vec<XorSet> = 4.8MB
sets2 of block_length as Vec<XorSet> = 4.8MB
finger_print = 1.23 MB

block_length ~= (keys.len() * 1.23) / 3

A total of 50MB memory footprint is required for indexing 1 Million
keys.

To index 1 billion keys we may need a footprint of 50GB of memory.

The text was updated successfully, but these errors were encountered:

lemire · 2020-01-21T16:32:42Z

To index 1 billion keys we may need a footprint of 50GB of memory.

Having a billion keys (64-bit integers) in memory would require 8GB. A billion keys is a quite a lot.

An obvious solution is sharding... divide your data set into subsets, and process each set separately. This would give you several small filters... which is maybe more manageable.

But what if you still want to construct a gigantic, out-of-memory, filter?

Then I think we want to use a sorting-based approach (because sorting can be done out-of-memory with relative ease).

How might this work? I think it is best expressed in code, but I think I can describe the idea. For each key, you have h1(k), h2(k), h3(k). Sort your keys by h1(k). Then go through the sorted list and find all occurrences when the h1(k) value occurs once. Remove these keys from the list and add them to a stack. Then sort what remains with respect to h2(k). And so forth.

If you implement it with some care, it could use no more memory than the size of the input... I guess...

You can use radix sort, so the sort routine can be in linear time (though this might be at the expense of memory).

cc @thomasmueller

prataprc mentioned this issue Jan 21, 2020

Serialize / Deserialize Xor8 type. #1

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle DGM scenario for Xor8 filter. #9

Handle DGM scenario for Xor8 filter. #9

prataprc commented Jan 21, 2020 •

edited

Loading

lemire commented Jan 21, 2020

Handle DGM scenario for Xor8 filter. #9

Handle DGM scenario for Xor8 filter. #9

Comments

prataprc commented Jan 21, 2020 • edited Loading

lemire commented Jan 21, 2020

prataprc commented Jan 21, 2020 •

edited

Loading