Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8382] Improve MOR-Snapshot-Query performance for COW like table #12112

Open
wants to merge 4 commits into
base: branch-0.x
Choose a base branch
from

Conversation

TheR1sing3un
Copy link
Member

@TheR1sing3un TheR1sing3un commented Oct 16, 2024

In some cases, a MOR table's latest (or view at time-travel specified instant) file-slices all have only base-file but empty log-files. When performs Snapshot-Query for these tables, we can regard it as MOR-ReadOptimized-Query and provide a HadoopFsRelation to Spark.

Change Logs

  1. regard mor snapshot query with all base-file-only table as mor read-optimized query
    Describe context and summary for this change. Highlight if any code was copied.

Impact

none
Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

low
If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Oct 16, 2024
@danny0405
Copy link
Contributor

Is this change related: #12080 ?

…read-optimized query

1. regard mor snapshot query with all base-file-only table as mor read-optimized query

Signed-off-by: TheR1sing3un <[email protected]>
@TheR1sing3un TheR1sing3un force-pushed the feat_optimize_mor_read_with_empty_log branch from be6673e to 083aff2 Compare October 18, 2024 04:52
@TheR1sing3un
Copy link
Member Author

TheR1sing3un commented Oct 18, 2024

Is this change related: #12080 ?

#12080 is optimizing filter pushdown for HoodieBaseRelation by reducing unnecessary columns.
My changes focus on regard [MergeOnReadRelation with all base-file-only file-slices] as BaseFileOnlyRelation so that we can fallback it to HadoopFsRelation. Spark has many optimizations for HadoopFsRelation which can improve our query performance.

@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Oct 18, 2024
1. optimize TestDataSkippingWithMORColstats

Signed-off-by: TheR1sing3un <[email protected]>
1. fix TestNestedSchemaPruningOptimization

Signed-off-by: TheR1sing3un <[email protected]>
1. fix TestSparkSqlWithCustomKeyGenerator, need unified time conversion for mor and cow in the future to solve the root problem

Signed-off-by: TheR1sing3un <[email protected]>
@@ -220,6 +220,15 @@ object DataSourceReadOptions {

val INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT: ConfigProperty[String] = HoodieCommonConfig.INCREMENTAL_READ_HANDLE_HOLLOW_COMMIT

val ENABLE_OPTIMIZED_READ_FOR_MOR_WITH_ALL_BASE_FILE_ONLY_SLICE: ConfigProperty[Boolean] = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a pure optimization, let's eliminate this option cc @jonvex and @yihua to take a look too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that just a read optimized query?

@TheR1sing3un
Copy link
Member Author

@hudi-bot run azure

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@jonvex
Copy link
Contributor

jonvex commented Oct 21, 2024

We have this all optimized with the new filegroup reader already so I'm not sure how much longer the relation implementations will be around anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants