feat(batch): support batch s3 parquet frontend part #17625

chenzl25 · 2024-07-09T07:03:55Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Related issue: Feat: Batch ingest iceberg/file source #14742
Support the frontend part of batch s3 parquet file table function. select * from file_scan('parquet', 's3', s3_region, s3_access_key, s3_secret_key, file_localtion). First, the planner will create a LogicalTableFunction. Secondly, LogicalTableFunction would be transformed into LogiacalFileScan by TableFunctionToFileScanRule. Finally, LogicalFileScan would be transformed into a BatchFileScan. To avoid the PR from being too large, the batch fragmentation and scheduler part would be implemented in another PR later.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

chenzl25 · 2024-07-09T07:06:11Z

src/frontend/src/expr/table_function.rs

+ tokio::task::block_in_place(|| {
+ tokio::runtime::Handle::current().block_on(async {
+ let parquet_stream_builder = create_parquet_stream_builder(
+ eval_args[2].clone(),
+ eval_args[3].clone(),
+ eval_args[4].clone(),
+ eval_args[5].clone(),
+ )
+ .await?;
+
+ let mut rw_types = vec![];
+ for field in parquet_stream_builder.schema().fields() {
+ rw_types.push((
+ field.name().to_string(),
+ IcebergArrowConvert.type_from_field(field)?,
+ ));
+ }
+
+ Ok::<risingwave_common::types::DataType, anyhow::Error>(DataType::Struct(
+ StructType::new(rw_types),
+ ))
+ })
+ })?


Derive the schema from a parquet file in the planner.

fuyufjh

LGTM.

src/frontend/src/expr/table_function.rs

chenzl25 · 2024-07-10T08:19:28Z

@wangrunji0408 the deterministic test reports the following error. Does it mean that we can't use block_in_place?

error[E0425]: cannot find function `block_in_place` in module `tokio::task`
  | --> src/frontend/src/expr/table_function.rs:135:26
  | \|
  | 135 \|             tokio::task::block_in_place(\|\| {
  | \|                          ^^^^^^^^^^^^^^ not found in `tokio::task`

wangrunji0408 · 2024-07-10T08:28:25Z

@wangrunji0408 the deterministic test reports the following error. Does it mean that we can't use block_in_place?

error[E0425]: cannot find function `block_in_place` in module `tokio::task`
  | --> src/frontend/src/expr/table_function.rs:135:26
  | \|
  | 135 \|             tokio::task::block_in_place(\|\| {
  | \|                          ^^^^^^^^^^^^^^ not found in `tokio::task`

No, we can't use block_in_place in simulation because there will be only one thread.
Why is block_in_place necessary here?

chenzl25 · 2024-07-10T08:46:17Z

@wangrunji0408 the deterministic test reports the following error. Does it mean that we can't use block_in_place?
error[E0425]: cannot find function `block_in_place` in module `tokio::task`
  | --> src/frontend/src/expr/table_function.rs:135:26
  | \|
  | 135 \|             tokio::task::block_in_place(\|\| {
  | \|                          ^^^^^^^^^^^^^^ not found in `tokio::task`
No, we can't use block_in_place in simulation because there will be only one thread. Why is block_in_place necessary here?

I want to call an async function and fetch its return type with a non-async function. Do you have any suggestions for achieving this goal?

BugenZhao · 2024-07-11T03:48:44Z

I want to call an async function and fetch its return type with a non-async function.

Not sure if it's a good practice to include asynchronous logic when creating an expression. 😢

chenzl25 · 2024-07-11T07:21:42Z

I want to call an async function and fetch its return type with a non-async function.

Not sure if it's a good practice to include asynchronous logic when creating an expression. 😢

True, but I don't have a better idea, because we need to fetch data to determine the schema for a LogicalPlan while the planner can't go across await points.

st1page · 2024-07-11T08:31:11Z

src/frontend/src/optimizer/plan_node/logical_file_scan.rs

+ fn prune_col(&self, required_cols: &[usize], _ctx: &mut ColumnPruningContext) -> PlanRef {
+ LogicalProject::with_out_col_idx(self.clone().into(), required_cols.iter().cloned()).into()
+ }


qq: Will we support column pruning on the FileSource? This might be a necessary feature for columnar storage formats like Parquet, right?

I guess it is doable 🤔

risingwave/src/batch/src/executor/s3_file_scan.rs

Line 119 in 607a2af

batch_stream_builder = batch_stream_builder.with_projection(ProjectionMask::all());

Yes, we can support it later.

src/frontend/src/expr/table_function.rs

chenzl25 added 4 commits July 8, 2024 16:18

support batch s3 parquet file executor

1e68043

update Cargo lock

c6f4c6e

support frontend part

64f72f9

resolve conflicts

451f2f1

chenzl25 requested a review from a team as a code owner July 9, 2024 07:03

chenzl25 requested review from MrCroxx, fuyufjh and st1page July 9, 2024 07:03

github-actions bot added the type/feature label Jul 9, 2024

chenzl25 requested review from xiangjinwu and BugenZhao July 9, 2024 07:04

chenzl25 commented Jul 9, 2024

View reviewed changes

chenzl25 added 3 commits July 9, 2024 17:04

refactor

bdde66c

fmt

d5bc942

fmt

cc1a2f2

fuyufjh approved these changes Jul 10, 2024

View reviewed changes

src/frontend/src/expr/table_function.rs Outdated Show resolved Hide resolved

fmt

27d75b0

chenzl25 and others added 2 commits July 11, 2024 15:30

bypass madsim

1cedcb3

Merge branch 'main' into dylan/support_batch_s3_parquet_frontend

c0733e4

st1page reviewed Jul 11, 2024

View reviewed changes

src/frontend/src/expr/table_function.rs Show resolved Hide resolved

chenzl25 added this pull request to the merge queue Jul 12, 2024

Merged via the queue into main with commit 45c9e2b Jul 12, 2024
31 of 32 checks passed

chenzl25 deleted the dylan/support_batch_s3_parquet_frontend branch July 12, 2024 08:21

neverchanje mentioned this pull request Jul 12, 2024

Document: feat(batch): support batch s3 parquet frontend part risingwavelabs/risingwave-docs#2369

Closed

chenzl25 mentioned this pull request Jul 12, 2024

feat(batch): support batch read s3 parquet file #17673

Merged

9 tasks

BugenZhao mentioned this pull request Sep 18, 2024

risingwave 2.0.0 risingwavelabs/homebrew-risingwave#44

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(batch): support batch s3 parquet frontend part #17625

feat(batch): support batch s3 parquet frontend part #17625

chenzl25 commented Jul 9, 2024 •

edited by fuyufjh

Loading

chenzl25 Jul 9, 2024

fuyufjh left a comment

chenzl25 commented Jul 10, 2024 •

edited

Loading

wangrunji0408 commented Jul 10, 2024

chenzl25 commented Jul 10, 2024 •

edited

Loading

BugenZhao commented Jul 11, 2024

chenzl25 commented Jul 11, 2024

st1page Jul 11, 2024

st1page Jul 11, 2024

chenzl25 Jul 11, 2024

feat(batch): support batch s3 parquet frontend part #17625

feat(batch): support batch s3 parquet frontend part #17625

Conversation

chenzl25 commented Jul 9, 2024 • edited by fuyufjh Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

chenzl25 Jul 9, 2024

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

chenzl25 commented Jul 10, 2024 • edited Loading

wangrunji0408 commented Jul 10, 2024

chenzl25 commented Jul 10, 2024 • edited Loading

BugenZhao commented Jul 11, 2024

chenzl25 commented Jul 11, 2024

st1page Jul 11, 2024

Choose a reason for hiding this comment

st1page Jul 11, 2024

Choose a reason for hiding this comment

chenzl25 Jul 11, 2024

Choose a reason for hiding this comment

chenzl25 commented Jul 9, 2024 •

edited by fuyufjh

Loading

chenzl25 commented Jul 10, 2024 •

edited

Loading

chenzl25 commented Jul 10, 2024 •

edited

Loading