-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265
[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265
Conversation
@@ -120,9 +120,7 @@ class Spark33LegacyHoodieParquetFileFormat(private val shouldAppendPartitionValu | |||
val resultSchema = StructType(partitionSchema.fields ++ requiredSchema.fields) | |||
val sqlConf = sparkSession.sessionState.conf | |||
val enableOffHeapColumnVector = sqlConf.offHeapColumnVectorEnabled | |||
val enableVectorizedReader: Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewers: In Spark3.3+, we will use the following code to check if we can do vecrized read:
override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = {
val conf = sparkSession.sessionState.conf
ParquetUtils.isBatchReadSupportedForSchema(conf, schema) && conf.wholeStageEnabled &&
!WholeStageCodegenExec.isTooManyFields(conf, schema)
}
So nested type can support vectorized read since Spark 3.3.
f5233e3
to
38ad80b
Compare
"hoodie.datasource.read.use.new.parquet.file.format" -> "false", | ||
"hoodie.file.group.reader.enabled" -> "false", | ||
"spark.sql.parquet.enableNestedColumnVectorizedReader" -> "true", | ||
"spark.sql.parquet.enableVectorizedReader" -> "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this test cover all the spark releases above 3.3.0 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test should cover all spark versions and not throw any exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, saw some Travis failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is caused by my modification. I'm trying to fix them.
88a6f69
to
dbcebc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stream2000 : Just checking if you are still working on the tests ?
Sorry for the late reply, I was busy with other stuff. Will fix the test ASAP. |
dbcebc5
to
e1423a8
Compare
acf2f8f
to
464d5bd
Compare
939389e
to
c63e93d
Compare
@hudi-bot run azure |
@danny0405 @yihua @xiarixiaoyao Hi, could you help review this pr~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Change Logs
For Spark3.3+ version, we can do vectorized read for nested columns. However when
spark.sql.parquet.enableNestedColumnVectorizedReader = true
andset spark.sql.parquet.enableVectorizedReader = true
is set, hudi will throw the following exception:We need to Fix Spark33LegacyHoodieParquetFileFormat/Spark34LegacyHoodieParquetFileFormat/Spark35LegacyHoodieParquetFileFormat vectorized read nested types
Impact
Enable vectorized reading for nested types when using legacy parquet formats by default. Howeverfor schema on read or implicit nested type change we need to set
set spark.sql.parquet.enableVectorizedReader =false
to run the query.Risk level (write none, low medium or high below)
medium
Documentation Update
NONE
Contributor's checklist