-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Implement MetadataProvider
, RoaringMetadataFilter
, and refactor MetadataFilteringOperator
#2847
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @Sicheng-Pan and the rest of your teammates on Graphite |
6877cbd
to
1567e7f
Compare
c079998
to
a5b3575
Compare
1567e7f
to
0724e2e
Compare
} | ||
} | ||
|
||
pub(crate) struct MetadataLogReader<'me> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - A quick comment on this struct would be helpful
.collect() | ||
); | ||
|
||
// A $lte check on metadata should yield matching records |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we break this test into pieces? Each of these check comments should/could be a separate unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will break it up in the future when we refactor the unit tests here. Currently the unit tests here have a lot of duplication to setup the test, which could be avoided with some shared utility functions.
.filter_by_metadata(&self.key, metadata_value, &Equal) | ||
.await?, | ||
), | ||
_ => Include( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: prefer enumerating all the cases here (even if code is duplicated). This ensure that in future when someone adds a new type of primitive_operator, the compiler forces them to add it here also otherwise it could be missed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the case here
return Err(MetadataFilteringError::MetadataFilteringIndexError(e)); | ||
} | ||
|
||
let filtered_compact_oids = if let Some(clause) = input.where_clause.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comment here that this takes care of filtering the deletes on the log from the metadata segment
compact_metadata: BTreeMap<&'me str, BTreeMap<&'me MetadataValue, RoaringBitmap>>, | ||
document: BTreeMap<u32, &'me str>, | ||
domain: RoaringBitmap, | ||
uid_to_oid: BTreeMap<&'me str, u32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this struct need to be BTreeMap or can it be a HashMap? In general, a btreemap has higher overhead than a hash map so prefer using hash map wherever possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to hashmap when possible
} | ||
|
||
pub(crate) struct MetadataLogReader<'me> { | ||
compact_metadata: BTreeMap<&'me str, BTreeMap<&'me MetadataValue, RoaringBitmap>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below - the outer map can be a hashmap instead of a btreemap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even the inner map does not have to be strictly a btreemap as we don't have a use-case for dynamically inserting in a sorted fashion. We could keep a Vec<(&MetadataValue, RoaringBitmap)>
, push all the data to it, sort it and keep it here. Then on demand, we could search using binary search
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add more color to not prefering a btreemap here - btreemaps generally have overhead for e.g. they replicate the key 6 ways. No strong opinion here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to hashmap
|
||
pub(crate) struct MetadataLogReader<'me> { | ||
compact_metadata: BTreeMap<&'me str, BTreeMap<&'me MetadataValue, RoaringBitmap>>, | ||
document: BTreeMap<u32, &'me str>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below - this can be a hashmap instead of a btreemap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to hashmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good abstractions. Nice!
a5b3575
to
6f7a56f
Compare
5296fc9
to
fd3d48c
Compare
Merge activity
|
…ctor `MetadataFilteringOperator`
fd3d48c
to
1e4d67f
Compare
Description of changes
Summarize the changes made by this PR.
MetadataFilteringOperator
. Now it should produce the exact set of offset ids that should be contained in the result, so that the next operator can only focus on hydrating the offset ids.MetadataProvider
enum, which represents an universal interface for metadata access. Currently it can be constructed either from materialized logs or aMetadataSegmentReader
.RoaringMetadataFilter
trait, which helps to evaluate aWhere
clause given aMetadataProvider
.$ne
and$nin
are supported as part of this processTest plan
How are these changes tested?
This is guaranteed to break existing tests. Only the last PR in the stack addresses the breaking changes, and should pass the existing tests. In particular, any test involving the query node should fail because of compilation errors.
pytest
for python,yarn test
for js,cargo test
for rustDocumentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
N/A