-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(batch): support batch s3 parquet frontend part #17625
Conversation
tokio::task::block_in_place(|| { | ||
tokio::runtime::Handle::current().block_on(async { | ||
let parquet_stream_builder = create_parquet_stream_builder( | ||
eval_args[2].clone(), | ||
eval_args[3].clone(), | ||
eval_args[4].clone(), | ||
eval_args[5].clone(), | ||
) | ||
.await?; | ||
|
||
let mut rw_types = vec![]; | ||
for field in parquet_stream_builder.schema().fields() { | ||
rw_types.push(( | ||
field.name().to_string(), | ||
IcebergArrowConvert.type_from_field(field)?, | ||
)); | ||
} | ||
|
||
Ok::<risingwave_common::types::DataType, anyhow::Error>(DataType::Struct( | ||
StructType::new(rw_types), | ||
)) | ||
}) | ||
})? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derive the schema from a parquet file in the planner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@wangrunji0408 the deterministic test reports the following error. Does it mean that we can't use
|
No, we can't use |
I want to call an |
Not sure if it's a good practice to include asynchronous logic when creating an expression. 😢 |
True, but I don't have a better idea, because we need to fetch data to determine the schema for a LogicalPlan while the planner can't go across await points. |
fn prune_col(&self, required_cols: &[usize], _ctx: &mut ColumnPruningContext) -> PlanRef { | ||
LogicalProject::with_out_col_idx(self.clone().into(), required_cols.iter().cloned()).into() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: Will we support column pruning on the FileSource? This might be a necessary feature for columnar storage formats like Parquet, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is doable 🤔
batch_stream_builder = batch_stream_builder.with_projection(ProjectionMask::all()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can support it later.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
select * from file_scan('parquet', 's3', s3_region, s3_access_key, s3_secret_key, file_localtion)
. First, the planner will create aLogicalTableFunction
. Secondly,LogicalTableFunction
would be transformed intoLogiacalFileScan
byTableFunctionToFileScanRule
. Finally,LogicalFileScan
would be transformed into aBatchFileScan
. To avoid the PR from being too large, the batch fragmentation and scheduler part would be implemented in another PR later.Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.