[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

geserdugarov · 2024-10-21T10:35:57Z

I've already created an issue HUDI-8394, but what to highlight and discuss this problem here.
I suppose, this is a critical issue with current master when:

bulk insert operation,
hoodie.datasource.write.row.writer.enable = false,
simple bucket index.

Describe the problem you faced

When I try to bulk insert to COW table, I see in file system parquet and log files, which is MOR table behavior.

I've checked that table is COW type.

cat ./.hoodie/hoodie.properties 
# ...
# hoodie.table.type=COPY_ON_WRITE       <-- COW table
# ...

But files are not for COW table:

ll ./dt\=2021-01-05/
# total 456
# drwxr-xr-x 2 d00838679 d00838679   4096 окт 19 15:33 ./
# drwxrwxr-x 4 d00838679 d00838679   4096 окт 19 15:32 ../
# -rw-r--r-- 1 d00838679 d00838679 435346 окт 19 15:32 00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet               <-- base file
# -rw-r--r-- 1 d00838679 d00838679   3412 окт 19 15:32 .00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet.crc
# -rw-r--r-- 1 d00838679 d00838679    978 окт 19 15:33 .00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31                <-- log file as for MOR table
# -rw-r--r-- 1 d00838679 d00838679     16 окт 19 15:33 ..00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31.crc
# -rw-r--r-- 1 d00838679 d00838679     96 окт 19 15:32 .hoodie_partition_metadata
# -rw-r--r-- 1 d00838679 d00838679     12 окт 19 15:32 ..hoodie_partition_metadata.crc

To Reproduce

To reproduce, existed test Test Bulk Insert Into Bucket Index Table could be modified and used:

test("Test Bulk Insert Into Bucket Index Table") {
  withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert", "hoodie.bulkinsert.shuffle.parallelism" -> "1") {
    withTempDir { tmp =>
      val tableName = generateTableName
      // Create a partitioned table
      spark.sql(
        s"""
            |create table $tableName (
            |  id int,
            |  dt string,
            |  name string,
            |  price double,
            |  ts long
            |) using hudi
            | tblproperties (
            | primaryKey = 'id,name',
            | type = 'cow',
            | preCombineField = 'ts',
            | hoodie.index.type = 'BUCKET',
            | hoodie.index.bucket.engine = 'SIMPLE',
            | hoodie.bucket.index.num.buckets = '2',
            | hoodie.bucket.index.hash.field = 'id,name',
            | hoodie.datasource.write.row.writer.enable = 'false')
            | partitioned by (dt)
            | location '${tmp.getCanonicalPath}'
            """.stripMargin)
      spark.sql(
        s"""
            | insert into $tableName values
            | (5, 'a1,1', 10, 1000, "2021-01-05")
            """.stripMargin)
      spark.sql(
        s"""
            | insert into $tableName values
            | (9, 'a3,3', 30, 3000, "2021-01-05")
         """.stripMargin)
      )
    }
  }
}

Expected behavior

For COW table, only parquet files should be created.

Environment Description

Hudi version : current master
Spark version : 3.5

The text was updated successfully, but these errors were encountered:

geserdugarov · 2024-10-21T10:49:26Z

Currently, for this case BucketIndexBulkInsertPartitioner is used:

hudi/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BucketIndexBulkInsertPartitioner.java

Lines 65 to 68 in 5ccb19b

 public Option<WriteHandleFactory> getWriteHandleFactory(int idx) { 

 return doAppend.get(idx) ? Option.of(new AppendHandleFactory()) : 

 Option.of(new SingleFileHandleCreateFactory(FSUtils.createNewFileId(getFileIdPfx(idx), 0), this.preserveHoodieMetadata)); 

 }

First insert uses SingleFileHandleCreateFactory, but the second insert will use AppendHandleFactory, and create log file.

I don't understand how Bulk insert to COW table with Simple bucket index should work by design. When we inserting data, that should update previous data, should we create new parquet file with new data, and call inline compaction (due to COW table type), or merge and write data to new parquet file, then it's not bulk insert?

danny0405 · 2024-10-21T11:26:51Z

Bulk_insert should only be executed once IMO, for second update, you should use upsert operation instead.

geserdugarov changed the title ~~[SUPPORT] MOR table behavior for Spark bulk insert to COW~~ [SUPPORT] MOR table behavior for Spark Bulk insert to COW Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

geserdugarov commented Oct 21, 2024 •

edited

Loading

geserdugarov commented Oct 21, 2024

danny0405 commented Oct 21, 2024

[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

Comments

geserdugarov commented Oct 21, 2024 • edited Loading

geserdugarov commented Oct 21, 2024

danny0405 commented Oct 21, 2024

geserdugarov commented Oct 21, 2024 •

edited

Loading