[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265

stream2000 · 2023-12-07T07:48:57Z

Change Logs

For Spark3.3+ version, we can do vectorized read for nested columns. However when
spark.sql.parquet.enableNestedColumnVectorizedReader = true and
set spark.sql.parquet.enableVectorizedReader = true is set, hudi will throw the following exception:

Job aborted due to stage failure: Task 0 in stage 28.0 failed 1 times, most recent failure: Lost task 0.0 in stage 28.0 (TID 51) (30.221.100.176 executor driver): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 1 times, most recent failure: Lost task 0.0 in stage 28.0 (TID 51) (30.221.100.176 executor driver): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeRow cannot be cast to org.apache.spark.sql.vectorized.ColumnarBatch
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:560)
	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.next(DataSourceScanExec.scala:549)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

We need to Fix Spark33LegacyHoodieParquetFileFormat/Spark34LegacyHoodieParquetFileFormat/Spark35LegacyHoodieParquetFileFormat vectorized read nested types

Impact

Enable vectorized reading for nested types when using legacy parquet formats by default. Howeverfor schema on read or implicit nested type change we need to set set spark.sql.parquet.enableVectorizedReader =false to run the query.

Risk level (write none, low medium or high below)

medium

Documentation Update

NONE

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

stream2000 · 2023-12-07T07:51:38Z

...rg/apache/spark/sql/execution/datasources/parquet/Spark33LegacyHoodieParquetFileFormat.scala

@@ -120,9 +120,7 @@ class Spark33LegacyHoodieParquetFileFormat(private val shouldAppendPartitionValu
 val resultSchema = StructType(partitionSchema.fields ++ requiredSchema.fields)
 val sqlConf = sparkSession.sessionState.conf
 val enableOffHeapColumnVector = sqlConf.offHeapColumnVectorEnabled
- val enableVectorizedReader: Boolean =


For reviewers: In Spark3.3+, we will use the following code to check if we can do vecrized read:

override def supportBatch(sparkSession: SparkSession, schema: StructType): Boolean = { val conf = sparkSession.sessionState.conf ParquetUtils.isBatchReadSupportedForSchema(conf, schema) && conf.wholeStageEnabled && !WholeStageCodegenExec.isTooManyFields(conf, schema) }

So nested type can support vectorized read since Spark 3.3.

danny0405 · 2023-12-08T02:45:19Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

+ "hoodie.datasource.read.use.new.parquet.file.format" -> "false",
+ "hoodie.file.group.reader.enabled" -> "false",
+ "spark.sql.parquet.enableNestedColumnVectorizedReader" -> "true",
+ "spark.sql.parquet.enableVectorizedReader" -> "true") {


will this test cover all the spark releases above 3.3.0 ?

This test should cover all spark versions and not throw any exceptions.

Hmm, saw some Travis failures.

Yes, this is caused by my modification. I'm trying to fix them.

bvaradar

@stream2000 : Just checking if you are still working on the tests ?

stream2000 · 2023-12-13T02:02:25Z

@stream2000 : Just checking if you are still working on the tests ?

Sorry for the late reply, I was busy with other stuff. Will fix the test ASAP.

…for spark3.3+

stream2000 · 2023-12-18T02:08:13Z

@hudi-bot run azure

…ange

hudi-bot · 2023-12-19T12:00:25Z

CI report:

ba0c6c3 UNKNOWN
2486763 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

stream2000 · 2023-12-20T03:14:38Z

@danny0405 @yihua @xiarixiaoyao Hi, could you help review this pr~

bvaradar

LGTM

…rmats (#10265) * [HUDI-7190] Fix legacy parquet format nested columns vectorized read for spark3.3+ * Fix nested type implicit schema evolution * fix legacy format support batch read * Add exception messages when vectorized read nested type with type change

stream2000 commented Dec 7, 2023

View reviewed changes

stream2000 force-pushed the HUDI-7190_fix_nested_type_vectorized_read_for_spark3.3plus branch from f5233e3 to 38ad80b Compare December 7, 2023 07:55

danny0405 reviewed Dec 8, 2023

View reviewed changes

stream2000 force-pushed the HUDI-7190_fix_nested_type_vectorized_read_for_spark3.3plus branch from 88a6f69 to dbcebc5 Compare December 8, 2023 08:50

bvaradar reviewed Dec 12, 2023

View reviewed changes

stream2000 added 4 commits December 14, 2023 11:36

[HUDI-7190] Fix legacy parquet format nested columns vectorized read …

1e7fda1

…for spark3.3+

[HUDI-7190] Fix legacy parquet format nested columns vectorized read …

93da434

…for spark3.3+

Fix nested type implicit schema evolution

74ab128

Fix nested type implicit schema evolution

e1423a8

stream2000 force-pushed the HUDI-7190_fix_nested_type_vectorized_read_for_spark3.3plus branch from dbcebc5 to e1423a8 Compare December 14, 2023 05:36

fix legacy format support batch read

464d5bd

stream2000 force-pushed the HUDI-7190_fix_nested_type_vectorized_read_for_spark3.3plus branch from acf2f8f to 464d5bd Compare December 14, 2023 13:32

fix tests

c63e93d

stream2000 force-pushed the HUDI-7190_fix_nested_type_vectorized_read_for_spark3.3plus branch 2 times, most recently from 939389e to c63e93d Compare December 15, 2023 08:19

stream2000 added 2 commits December 19, 2023 13:44

Add exception messages when vectorized reead nested type with type ch…

31c78ee

…ange

fix tests

2486763

bvaradar approved these changes Dec 20, 2023

View reviewed changes

bvaradar merged commit d0916cb into apache:master Dec 20, 2023
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265

[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265

stream2000 commented Dec 7, 2023 •

edited

Loading

stream2000 Dec 7, 2023

danny0405 Dec 8, 2023

stream2000 Dec 8, 2023

danny0405 Dec 9, 2023

stream2000 Dec 9, 2023

bvaradar left a comment

stream2000 commented Dec 13, 2023

stream2000 commented Dec 18, 2023

hudi-bot commented Dec 19, 2023

stream2000 commented Dec 20, 2023

bvaradar left a comment

[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265

[HUDI-7190] Fix nested columns vectorized read for spark33+ legacy formats #10265

Conversation

stream2000 commented Dec 7, 2023 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

stream2000 Dec 7, 2023

Choose a reason for hiding this comment

danny0405 Dec 8, 2023

Choose a reason for hiding this comment

stream2000 Dec 8, 2023

Choose a reason for hiding this comment

danny0405 Dec 9, 2023

Choose a reason for hiding this comment

stream2000 Dec 9, 2023

Choose a reason for hiding this comment

bvaradar left a comment

Choose a reason for hiding this comment

stream2000 commented Dec 13, 2023

stream2000 commented Dec 18, 2023

hudi-bot commented Dec 19, 2023

CI report:

stream2000 commented Dec 20, 2023

bvaradar left a comment

Choose a reason for hiding this comment

stream2000 commented Dec 7, 2023 •

edited

Loading