Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(executor): decode row datums from pk #2957

Merged
merged 42 commits into from
Jun 7, 2022
Merged

Conversation

kwannoel
Copy link
Contributor

@kwannoel kwannoel commented Jun 1, 2022

#588 requires us to change the way rows are encoded / decoded.
This PR provides support for decoding first, since it is compatible with the current encoding.

Future PRs would need to:

  • Implement dedup pk encoding for mview (and other executors which write).
  • Update all executors which read from storage (e.g. StreamScan), to handle new encoding.

What's changed and what's your intention?

  • Summarize your change (mandatory)
    • Introduce DedupPkCellTableIter to decode pk buffer into row datums.
    • Update RowSeqScan executor to use DedupPkCellTableIter when fetching from storage.
  • How does this PR work? Need a brief introduction for the changed logic
    DedupPkCellTableIter uses:
    • OrderedRowDeserializer to decode pk,
    • and column ids to map decoded pk datums into row positions.
  • Describe any limitations of the current code (optional)
    • For backwards compatibility I currently rely on e2e tests, since the PR is large. Unit tests only cover decoding correctness for new encoding. Can add unit tests for these in separate PR.
    • Unit tests use lower level BatchWrite interface for writing to storage,
      because row encoding from relational layer differs from dedup pk encoding.
      In future PR when encoding is implemented on relational layer, this logic can be replaced.

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

related #588

TODOs

  • Update RowSeqScan to use CellBasedRowWithPkIter instead.
  • Special case for memcomparable != value enc. Test this for all datatypes.
  • Naming perhaps dedupcelliter or something similar is better....

@kwannoel kwannoel changed the title dedup arrange row and row perf(executor): Deduplicate keys stored in values Jun 1, 2022
@kwannoel kwannoel changed the title perf(executor): Deduplicate keys stored in values perf(executor): Store a column either in pk or value, but not both Jun 1, 2022
@kwannoel
Copy link
Contributor Author

kwannoel commented Jun 3, 2022

(made a digression to learn how pk is derived from mview, not blocked on anything)

mview -> plans (use explain)-> each plan might define pk, for example logicalAgg: https:/singularity-data/risingwave/blob/031a8efda7d3c85b4a1722d36aef10d4ac4c4d25/src/frontend/src/optimizer/plan_node/logical_agg.rs#L303-L307

If there are no group keys, we might go to the extent of duplicating the entire row, once in key and once in value.

@kwannoel kwannoel force-pushed the kwannoel/mview-pk-repr branch 4 times, most recently from 9f0f578 to c6bf910 Compare June 6, 2022 11:45
@codecov
Copy link

codecov bot commented Jun 6, 2022

Codecov Report

Merging #2957 (db33967) into main (db33967) will not change coverage.
The diff coverage is n/a.

❗ Current head db33967 differs from pull request most recent head 0f166e3. Consider uploading reports for the commit 0f166e3 to get more accurate results

@@           Coverage Diff           @@
##             main    #2957   +/-   ##
=======================================
  Coverage   73.21%   73.21%           
=======================================
  Files         726      726           
  Lines       97999    97999           
=======================================
  Hits        71748    71748           
  Misses      26251    26251           
Flag Coverage Δ
rust 73.21% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@kwannoel kwannoel force-pushed the kwannoel/mview-pk-repr branch 4 times, most recently from cd10bea to 525714d Compare June 7, 2022 10:14
@kwannoel kwannoel changed the title perf(executor): Store a column either in pk or value, but not both perf(executor): Decode row datums from pk Jun 7, 2022
@kwannoel kwannoel changed the title perf(executor): Decode row datums from pk perf(executor): decode row datums from pk Jun 7, 2022
@kwannoel kwannoel marked this pull request as ready for review June 7, 2022 11:24
@kwannoel kwannoel requested review from lmatz, skyzh and wcy-fdu June 7, 2022 11:28
Copy link
Contributor

@lmatz lmatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Good work

// Given the following row: | user_id | age | name |
// if pk was derived from `user_id, name`
// we can decode pk -> user_id, name,
// and retrieve the row: |_| age |_|,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current approach is good.

As this PR is for decoding only, I think after encoding is implemented, BatchQueryExecutor will also need to use DedupPkCellBasedTableRowIter as RowSeqScanExecutor, right?

Otherwise, the row it returns will have holes in it as pk is not filled in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this PR is for decoding only, I think after encoding is implemented, BatchQueryExecutor will also need to use DedupPkCellBasedTableRowIter as RowSeqScanExecutor, right?

Yup that's correct!

}

#[async_trait::async_trait]
impl<S: StateStore> CellTableChunkIter for CellBasedTableRowIter<S> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the whole feature (encoding part also) is done, I guess CellBasedTableRowIter will never be used as returning a DataChunk probably? 🤔

And this CellTableChunkIter will only be used by DedupPkCellBasedTableRowIter. Maybe we can save one trait, we can revisit this in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants