Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

Open
geserdugarov opened this issue Oct 21, 2024 · 2 comments
Open

[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133

geserdugarov opened this issue Oct 21, 2024 · 2 comments

Comments

@geserdugarov
Copy link
Contributor

geserdugarov commented Oct 21, 2024

I've already created an issue HUDI-8394, but what to highlight and discuss this problem here.
I suppose, this is a critical issue with current master when:

  • bulk insert operation,
  • hoodie.datasource.write.row.writer.enable = false,
  • simple bucket index.

Describe the problem you faced

When I try to bulk insert to COW table, I see in file system parquet and log files, which is MOR table behavior.

I've checked that table is COW type.

cat ./.hoodie/hoodie.properties 
# ...
# hoodie.table.type=COPY_ON_WRITE       <-- COW table
# ...

But files are not for COW table:

ll ./dt\=2021-01-05/
# total 456
# drwxr-xr-x 2 d00838679 d00838679   4096 окт 19 15:33 ./
# drwxrwxr-x 4 d00838679 d00838679   4096 окт 19 15:32 ../
# -rw-r--r-- 1 d00838679 d00838679 435346 окт 19 15:32 00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet               <-- base file
# -rw-r--r-- 1 d00838679 d00838679   3412 окт 19 15:32 .00000001-4a79-47b3-918c-05f8b90e8b14-0_1-14-12_20241019083242289.parquet.crc
# -rw-r--r-- 1 d00838679 d00838679    978 окт 19 15:33 .00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31                <-- log file as for MOR table
# -rw-r--r-- 1 d00838679 d00838679     16 окт 19 15:33 ..00000001-4a79-47b3-918c-05f8b90e8b14-0_20241019083307134.log.1_0-30-31.crc
# -rw-r--r-- 1 d00838679 d00838679     96 окт 19 15:32 .hoodie_partition_metadata
# -rw-r--r-- 1 d00838679 d00838679     12 окт 19 15:32 ..hoodie_partition_metadata.crc 

To Reproduce

To reproduce, existed test Test Bulk Insert Into Bucket Index Table could be modified and used:

test("Test Bulk Insert Into Bucket Index Table") {
  withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert", "hoodie.bulkinsert.shuffle.parallelism" -> "1") {
    withTempDir { tmp =>
      val tableName = generateTableName
      // Create a partitioned table
      spark.sql(
        s"""
            |create table $tableName (
            |  id int,
            |  dt string,
            |  name string,
            |  price double,
            |  ts long
            |) using hudi
            | tblproperties (
            | primaryKey = 'id,name',
            | type = 'cow',
            | preCombineField = 'ts',
            | hoodie.index.type = 'BUCKET',
            | hoodie.index.bucket.engine = 'SIMPLE',
            | hoodie.bucket.index.num.buckets = '2',
            | hoodie.bucket.index.hash.field = 'id,name',
            | hoodie.datasource.write.row.writer.enable = 'false')
            | partitioned by (dt)
            | location '${tmp.getCanonicalPath}'
            """.stripMargin)
      spark.sql(
        s"""
            | insert into $tableName values
            | (5, 'a1,1', 10, 1000, "2021-01-05")
            """.stripMargin)
      spark.sql(
        s"""
            | insert into $tableName values
            | (9, 'a3,3', 30, 3000, "2021-01-05")
         """.stripMargin)
      )
    }
  }
}

Expected behavior

For COW table, only parquet files should be created.

Environment Description

  • Hudi version : current master

  • Spark version : 3.5

@geserdugarov
Copy link
Contributor Author

Currently, for this case BucketIndexBulkInsertPartitioner is used:

public Option<WriteHandleFactory> getWriteHandleFactory(int idx) {
return doAppend.get(idx) ? Option.of(new AppendHandleFactory()) :
Option.of(new SingleFileHandleCreateFactory(FSUtils.createNewFileId(getFileIdPfx(idx), 0), this.preserveHoodieMetadata));
}

First insert uses SingleFileHandleCreateFactory, but the second insert will use AppendHandleFactory, and create log file.

I don't understand how Bulk insert to COW table with Simple bucket index should work by design. When we inserting data, that should update previous data, should we create new parquet file with new data, and call inline compaction (due to COW table type), or merge and write data to new parquet file, then it's not bulk insert?

@geserdugarov geserdugarov changed the title [SUPPORT] MOR table behavior for Spark bulk insert to COW [SUPPORT] MOR table behavior for Spark Bulk insert to COW Oct 21, 2024
@danny0405
Copy link
Contributor

Bulk_insert should only be executed once IMO, for second update, you should use upsert operation instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants