perf(executor): decode row datums from pk #2957

kwannoel · 2022-06-01T18:05:33Z

#588 requires us to change the way rows are encoded / decoded.
This PR provides support for decoding first, since it is compatible with the current encoding.

Future PRs would need to:

Implement dedup pk encoding for mview (and other executors which write).
Update all executors which read from storage (e.g. StreamScan), to handle new encoding.

What's changed and what's your intention?

Summarize your change (mandatory)
- Introduce DedupPkCellTableIter to decode pk buffer into row datums.
- Update RowSeqScan executor to use DedupPkCellTableIter when fetching from storage.
How does this PR work? Need a brief introduction for the changed logic
DedupPkCellTableIter uses:
- OrderedRowDeserializer to decode pk,
- and column ids to map decoded pk datums into row positions.
Describe any limitations of the current code (optional)
- For backwards compatibility I currently rely on e2e tests, since the PR is large. Unit tests only cover decoding correctness for new encoding. Can add unit tests for these in separate PR.
- Unit tests use lower level BatchWrite interface for writing to storage,
  because row encoding from relational layer differs from dedup pk encoding.
  In future PR when encoding is implemented on relational layer, this logic can be replaced.

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

related #588

TODOs

Update RowSeqScan to use CellBasedRowWithPkIter instead.
Special case for memcomparable != value enc. Test this for all datatypes.
Naming perhaps dedupcelliter or something similar is better....

src/stream/src/executor/mview/materialize.rs

kwannoel · 2022-06-03T15:46:37Z

(made a digression to learn how pk is derived from mview, not blocked on anything)

mview -> plans (use explain)-> each plan might define pk, for example logicalAgg: https:/singularity-data/risingwave/blob/031a8efda7d3c85b4a1722d36aef10d4ac4c4d25/src/frontend/src/optimizer/plan_node/logical_agg.rs#L303-L307

If there are no group keys, we might go to the extent of duplicating the entire row, once in key and once in value.

src/batch/src/executor/row_seq_scan.rs

src/common/src/util/ordered/serde.rs

src/compute/tests/row_seq_scan.rs

src/frontend/src/optimizer/plan_node/batch_seq_scan.rs

codecov · 2022-06-06T13:31:51Z

Codecov Report

Merging #2957 (db33967) into main (db33967) will not change coverage.
The diff coverage is n/a.

❗ Current head db33967 differs from pull request most recent head 0f166e3. Consider uploading reports for the commit 0f166e3 to get more accurate results

@@           Coverage Diff           @@
##             main    #2957   +/-   ##
=======================================
  Coverage   73.21%   73.21%           
=======================================
  Files         726      726           
  Lines       97999    97999           
=======================================
  Hits        71748    71748           
  Misses      26251    26251

Flag	Coverage Δ
rust	`73.21% <0.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

src/storage/src/table/cell_based_table.rs

lmatz

LGTM! Good work

lmatz · 2022-06-07T12:00:31Z

src/storage/src/table/cell_based_table.rs

+// Given the following row: | user_id | age | name |
+// if pk was derived from `user_id, name`
+// we can decode pk -> user_id, name,
+// and retrieve the row: |_| age |_|,


The current approach is good.

As this PR is for decoding only, I think after encoding is implemented, BatchQueryExecutor will also need to use DedupPkCellBasedTableRowIter as RowSeqScanExecutor, right?

Otherwise, the row it returns will have holes in it as pk is not filled in.

As this PR is for decoding only, I think after encoding is implemented, BatchQueryExecutor will also need to use DedupPkCellBasedTableRowIter as RowSeqScanExecutor, right?

Yup that's correct!

lmatz · 2022-06-07T12:04:51Z

src/storage/src/table/cell_based_table.rs

+}
+
+#[async_trait::async_trait]
+impl<S: StateStore> CellTableChunkIter for CellBasedTableRowIter<S> {}


After the whole feature (encoding part also) is done, I guess CellBasedTableRowIter will never be used as returning a DataChunk probably? 🤔

And this CellTableChunkIter will only be used by DedupPkCellBasedTableRowIter. Maybe we can save one trait, we can revisit this in the future.

Agree with this

github-actions bot added the Invalid PR Title label Jun 1, 2022

kwannoel changed the title ~~dedup arrange row and row~~ perf(executor): Deduplicate keys stored in values Jun 1, 2022

github-actions bot removed the Invalid PR Title label Jun 1, 2022

kwannoel changed the title ~~perf(executor): Deduplicate keys stored in values~~ perf(executor): Store a column either in pk or value, but not both Jun 1, 2022

github-actions bot added the type/perf label Jun 1, 2022

kwannoel commented Jun 1, 2022

View reviewed changes

src/stream/src/executor/mview/materialize.rs Outdated Show resolved Hide resolved

kwannoel force-pushed the kwannoel/mview-pk-repr branch from 4a8bd6c to e6f4582 Compare June 2, 2022 14:41

kwannoel force-pushed the kwannoel/mview-pk-repr branch 4 times, most recently from 9f0f578 to c6bf910 Compare June 6, 2022 11:45

kwannoel commented Jun 6, 2022

View reviewed changes

src/batch/src/executor/row_seq_scan.rs Outdated Show resolved Hide resolved

kwannoel commented Jun 6, 2022

View reviewed changes

src/common/src/util/ordered/serde.rs Outdated Show resolved Hide resolved

kwannoel commented Jun 6, 2022

View reviewed changes

src/compute/tests/row_seq_scan.rs Outdated Show resolved Hide resolved

kwannoel commented Jun 6, 2022

View reviewed changes

src/frontend/src/optimizer/plan_node/batch_seq_scan.rs Show resolved Hide resolved

kwannoel commented Jun 7, 2022

View reviewed changes

src/storage/src/table/cell_based_table.rs Show resolved Hide resolved

kwannoel force-pushed the kwannoel/mview-pk-repr branch 4 times, most recently from cd10bea to 525714d Compare June 7, 2022 10:14

kwannoel changed the title ~~perf(executor): Store a column either in pk or value, but not both~~ perf(executor): Decode row datums from pk Jun 7, 2022

kwannoel changed the title ~~perf(executor): Decode row datums from pk~~ perf(executor): decode row datums from pk Jun 7, 2022

kwannoel added 6 commits June 7, 2022 19:19

add decode pk cellbasedtable interface

83938ff

add constructor

1c9f180

impl collect data chunk

e9c3066

clean

3d0cb2b

add docs

c5445c0

init row seq scan with pk iter

010380a

kwannoel added 20 commits June 7, 2022 19:19

refactor

754f821

clean

6d48a01

clean

1692e67

cleanup print stmts

71fecd3

fmt

5cc702a

clarify

3cb81ab

refactor

da9e07b

clarify

6c40cda

refactor

e4b02e7

refactor collect_data_chunk

1037438

cleanup

d6cb6e4

refactor

c70156a

refactor

ba941b4

rewrite test for decoding deduped cells

1303a7e

fix writes

1eb1957

fix rebase conflicts + fmt

02fb7dd

clean

d976610

refactor

3526e4e

fix

9f43698

test memcomparable values

0f166e3

kwannoel force-pushed the kwannoel/mview-pk-repr branch from 525714d to 0f166e3 Compare June 7, 2022 11:19

kwannoel marked this pull request as ready for review June 7, 2022 11:24

kwannoel requested review from lmatz, skyzh and wcy-fdu June 7, 2022 11:28

lmatz approved these changes Jun 7, 2022

View reviewed changes

kwannoel merged commit 109e7d0 into main Jun 7, 2022

kwannoel deleted the kwannoel/mview-pk-repr branch June 7, 2022 12:18

kwannoel mentioned this pull request Jun 8, 2022

perf(executor): implement dedup pk decoding for BatchQueryExecutor #3060

Merged

3 tasks

kwannoel mentioned this pull request Jun 22, 2022

Tracking: Cell encoding - store a column either in pk or value, but not both #3412

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(executor): decode row datums from pk #2957

perf(executor): decode row datums from pk #2957

kwannoel commented Jun 1, 2022 •

edited

Loading

kwannoel commented Jun 3, 2022 •

edited

Loading

codecov bot commented Jun 6, 2022 •

edited

Loading

lmatz left a comment

lmatz Jun 7, 2022

kwannoel Jun 7, 2022

lmatz Jun 7, 2022

kwannoel Jun 7, 2022

perf(executor): decode row datums from pk #2957

perf(executor): decode row datums from pk #2957

Conversation

kwannoel commented Jun 1, 2022 • edited Loading

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

TODOs

kwannoel commented Jun 3, 2022 • edited Loading

codecov bot commented Jun 6, 2022 • edited Loading

Codecov Report

lmatz left a comment

Choose a reason for hiding this comment

lmatz Jun 7, 2022

Choose a reason for hiding this comment

kwannoel Jun 7, 2022

Choose a reason for hiding this comment

lmatz Jun 7, 2022

Choose a reason for hiding this comment

kwannoel Jun 7, 2022

Choose a reason for hiding this comment

kwannoel commented Jun 1, 2022 •

edited

Loading

kwannoel commented Jun 3, 2022 •

edited

Loading

codecov bot commented Jun 6, 2022 •

edited

Loading