187: Adding apache hudi support to dbt #210

vingov · 2021-08-30T23:14:40Z

resolves #187

Description

Apache Hudi brings ACID transactions, record-level updates/deletes, and change streams to data lakes. Both Hudi & dbt are great technologies, this PR integrates the apache hudi file-format support to dbt to allow users to create and model hudi datasets using dbt.

This PR adds one more file format which supports incremental merge strategy, now users can use this feature on all spark environments. In addition to the delta format which works only on databricks runtime environment.

Tested locally:

Found 5 models, 10 tests, 0 snapshots, 0 analyses, 169 macros, 0 operations, 0 seed files, 0 sources, 0 exposures

15:58:41 | Concurrency: 1 threads (target='local')
15:58:41 |
15:58:41 | 1 of 5 START incremental model analytics.hudi_insert_table........... [RUN]
15:59:40 | 1 of 5 OK created incremental model analytics.hudi_insert_table...... [OK in 59.70s]
15:59:40 | 2 of 5 START incremental model analytics.hudi_insert_overwrite_table. [RUN]
16:00:12 | 2 of 5 OK created incremental model analytics.hudi_insert_overwrite_table [OK in 31.27s]
16:00:12 | 3 of 5 START incremental model analytics.hudi_upsert_table........... [RUN]
16:00:32 | 3 of 5 OK created incremental model analytics.hudi_upsert_table...... [OK in 20.90s]
16:00:32 | 4 of 5 START incremental model analytics.hudi_upsert_partitioned_cow_table [RUN]
16:00:54 | 4 of 5 OK created incremental model analytics.hudi_upsert_partitioned_cow_table [OK in 21.59s]
16:00:54 | 5 of 5 START incremental model analytics.hudi_upsert_partitioned_mor_table [RUN]
16:01:15 | 5 of 5 OK created incremental model analytics.hudi_upsert_partitioned_mor_table [OK in 20.76s]
16:01:15 |
16:01:15 | Finished running 5 incremental models in 174.02s.

Completed successfully

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests.
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

atul016 · 2021-09-22T14:35:32Z

@vingov What is the progress on this ? are you still working on this?

vingov · 2021-09-22T15:06:07Z

Yes @atul016, I will fix the integration test in a few days.

The code works in Spark 3, and hits an edge case in Spark 2, the fix should be in apache hudi repo or possibly a config change here.

Apart from the integration tests, do you have any other questions or comments about this PR?

vinothchandar · 2021-09-23T21:24:33Z

@atul016 There is a lot of user interest for this. (I am the PMC chair for Apache Hudi). Please let us know how we can help take this forward.

rubenssoto · 2021-10-14T21:55:24Z

@vingov how are you?

Do you have news about this PR...this is a very useful feature :)

jtcohen6 · 2021-10-15T13:52:26Z

@vingov Thank you for this amazing contribution, and for the detailed testing you're adding along the way! I would love to include this in dbt-spark==v1.0.0, which we'll be releasing in December to coordinate with dbt-core v1.0.0. I'd be happy to work with you to get this over the finish line.

It looks like you're running into a tight spot around the integration tests, related to Spark 2. I'd have no opposition to upgrading the containerized cluster we run in CI, so that it uses Spark 3 instead, but we've struggled with that update in the past (#145).

Also, as a fair warning, we're likely to move integration tests from CircleCI to GitHub Actions (as we've done for other plugins), and to lightly refactor the way we're setting up those integration tests (#227). I don't think that should impact the majority of your changes, I just don't want a big merge conflict to come as a surprise.

vingov · 2021-10-15T19:20:04Z

@vingov how are you?

Do you have news about this PR...this is a very useful feature :)

Hey @rubenssoto - Sorry, I was on a vacation for the last few weeks, I'm back and I'll get this landed soon.

vingov · 2021-10-15T19:25:42Z

@jtcohen6 - Thanks for the insights, I was about to ask you about updating the CI to use Spark 3, I will dig deeper into the gaps on #145.

Thanks for the heads-up on GitHub Actions.

I will iterate on Spark 3, after my findings, we can work together to get this PR landed. I'm on the dbt slack as well, you can reach me over there to iterate faster.

rubenssoto · 2021-10-15T20:06:51Z

@vingov dont be sorry, thank you so much for your work :)

vingov · 2021-10-25T04:43:44Z

@jtcohen6 - Hey, can you please approve the CI workflow, to run the integration tests? It's stopped with a message that it needs a maintainer to approve running workflows.

jtcohen6 · 2021-10-25T08:40:17Z

@vingov Approved to run unit tests and code checks via GitHub Actions. We're still mid-cutover between CircleCI and GHA

…on tests

vingov · 2021-11-17T23:21:23Z

Hey @jtcohen6 - can you please approve the CI workflow, to run the integration tests? I rebased the fixed the integration tests, ran the circle ci locally to test it out as well.

… version

vingov · 2021-11-18T07:28:05Z

@jtcohen6 - I'm really sorry to bug you again, last time I checked & fixed only the integration-spark-thrift circle ci tests.

The databricks tests were not running in my local, hence could not test it out locally, now I have fixed that databricks error as well, can you please approve the workflow again? thanks in advance.

jtcohen6

@vingov Sure thing! It looks like the one failing test may be related to lack of hudi support on Databricks. I'd recommend disabling that model for those tests or, if you see fit, setting up the persist_docs test case to run on Apache Spark + Hudi as well.

jtcohen6 · 2021-11-18T13:57:46Z

tests/integration/persist_docs/models/table_hudi_model.sql

@@ -0,0 +1,2 @@
+{{ config(materialized='table', file_format='hudi') }}


Am I right in thinking that this is failing on Databricks because the hudi file format is not available there?

This specific test case (tests.integration.persist_docs) isn't running on Apache Spark right now. You're welcome to either:

add a test to test_persist_docs.py, separate from test_delta_comments, with @use_profile("apache_spark"), and configure this model to be enabled only for that test, and disabled when running with a databricks profile

disable this model for the time being

Yes, you are right, since there were many iterations on this PR already, For now, I'll disable the model to keep it simple and merge this PR, later in the next iteration I'll bring back both these tests.

jtcohen6 · 2021-11-18T14:00:08Z

tests/integration/incremental_strategies/test_incremental_strategies.py

@@ -77,6 +81,26 @@ def run_and_test(self):
 def test_delta_strategies_databricks_cluster(self):
 self.run_and_test()

+# Uncomment this hudi integration test after the hudi 0.10.0 release to make it work.


Neat! Out of curiosity, what's the change coming in v0.10 that will make this sail smoothly?

Spark SQL DML support has been added to Apache Hudi recently with the 0.9.0 release, but there were a few gaps that got fixed after we released the last version, which is scheduled for the next release in a few weeks.

Most specifically, these commits are the ones that are relevant to making these tests run smoothly.

…, will be added later

vingov · 2021-11-18T20:22:32Z

@jtcohen6 - Please approve the workflow one last time, thanks!

vingov · 2021-11-18T22:39:33Z

@jtcohen6 - Finally all the integration tests passed, I guess it still needs your approval for running the python 3.8 unit tests.

jtcohen6

@vingov Thank you for the contribution! Very neat to be able to include this in time for v1 :)

* Refactor seed macros, clearer sql param logging (#250) * Try refactoring seed macros * Add changelog entry * 187: Adding apache hudi support to dbt (#210) * initial working version * Rebased and resolve all the merge conflicts. * Rebased and resolved merge conflicts. * Removed hudi dep jar and used the released version via packages option * Added insert overwrite unit tests for hudi * Used unique_key as default value for hudi primaryKey option * Updated changelog.md with this new update. * Final round of testing and few minor fixes * Fixed lint issues * Fixed the integration tests * Fixed the circle ci env to add hudi packages * Updated hudi spark bundle to use scala 2.11 * Fixed Hudi incremental strategy integration tests and other integration tests * Fixed the hudi hive sync hms integration test issues * Added sql HMS config to fix the integration tests. * Added hudi hive sync mode conf to CI * Set the hms schema verification to false * Removed the merge update columns hence its not supported. * Passed the correct hiveconf to the circle ci build script * Disabled few incremental tests for spark2 and reverted to spark2 config * Added hudi configs to the circle ci build script * Commented out the Hudi integration test until we have the hudi 0.10.0 version * Fixed the macro which checks the table type. * Disabled this model since hudi is not supported in databricks runtime, will be added later * Update profile_template.yml for v1 (#247) * Update profile_template.yml for v1 * PR feedback, fix indentation issues * It was my intention to remove the square brackets * Fixup changelog entry * Merge main, update changelog * Bump version to 1.0.0rc2 (#259) * bumpversion 1.0.0rc2 * Update changelog * Use pytest-dbt-adapter==0.6.0 * Corrected definition for set full_refresh_mode (#262) * Replaced definition for set full_refresh_mode * Updated changelog * Edit changelog Co-authored-by: Jeremy Cohen <[email protected]> * `get_response` -> `AdapterResponse` (#265) * Return AdapterResponse from get_response * fix flake Co-authored-by: Jeremy Cohen <[email protected]> Co-authored-by: Vinoth Govindarajan <[email protected]> Co-authored-by: Sindre Grindheim <[email protected]>

cla-bot bot added the cla:yes label Aug 30, 2021

vingov force-pushed the apache_hudi_support branch from 35f5fc7 to 68936d5 Compare October 25, 2021 04:40

Vinoth Govindarajan added 18 commits October 25, 2021 22:31

initial working version

3b781cb

Rebased and resolve all the merge conflicts.

c3d11fe

Rebased and resolved merge conflicts.

022edba

Removed hudi dep jar and used the released version via packages option

cd22177

Added insert overwrite unit tests for hudi

59b1370

Used unique_key as default value for hudi primaryKey option

b0e45fd

Updated changelog.md with this new update.

10a50ca

Final round of testing and few minor fixes

705a777

Fixed lint issues

9616bb0

Fixed the integration tests

283c7d1

Fixed the circle ci env to add hudi packages

8f49b09

Updated hudi spark bundle to use scala 2.11

a4f0699

Fixed Hudi incremental strategy integration tests and other integrati…

f521ca9

…on tests

Fixed the hudi hive sync hms integration test issues

7ba9b1b

Added sql HMS config to fix the integration tests.

46be053

Added hudi hive sync mode conf to CI

d9e15a0

Set the hms schema verification to false

ca588b2

Removed the merge update columns hence its not supported.

2d5ba2e

vingov added 3 commits October 25, 2021 22:46

Passed the correct hiveconf to the circle ci build script

4b43b46

Disabled few incremental tests for spark2 and reverted to spark2 config

aab2160

Added hudi configs to the circle ci build script

ae3bfe3

vingov force-pushed the apache_hudi_support branch from 68936d5 to ae3bfe3 Compare October 26, 2021 05:51

Commented out the Hudi integration test until we have the hudi 0.10.0…

0723de9

… version

vingov force-pushed the apache_hudi_support branch from b7e4890 to 0723de9 Compare November 17, 2021 23:31

Fixed the macro which checks the table type.

202e88a

jtcohen6 reviewed Nov 18, 2021

View reviewed changes

Disabled this model since hudi is not supported in databricks runtime…

22a2025

…, will be added later

jtcohen6 approved these changes Nov 19, 2021

View reviewed changes

jtcohen6 merged commit 68a3b5a into dbt-labs:main Nov 19, 2021

vingov mentioned this pull request Aug 25, 2022

Updated the apache spark adapter page to add Apache Hudi support dbt-labs/docs.getdbt.com#1912

Merged

dbeatty10 mentioned this pull request Dec 18, 2023

[ADAP-614] [Bug] unique_key config for snapshots using hudi file_format #801

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

187: Adding apache hudi support to dbt #210

187: Adding apache hudi support to dbt #210

vingov commented Aug 30, 2021

atul016 commented Sep 22, 2021

vingov commented Sep 22, 2021

vinothchandar commented Sep 23, 2021

rubenssoto commented Oct 14, 2021

jtcohen6 commented Oct 15, 2021

vingov commented Oct 15, 2021

vingov commented Oct 15, 2021

rubenssoto commented Oct 15, 2021

vingov commented Oct 25, 2021

jtcohen6 commented Oct 25, 2021

vingov commented Nov 17, 2021

vingov commented Nov 18, 2021

jtcohen6 left a comment

jtcohen6 Nov 18, 2021

vingov Nov 18, 2021

jtcohen6 Nov 18, 2021

vingov Nov 18, 2021

vingov commented Nov 18, 2021

vingov commented Nov 18, 2021

jtcohen6 left a comment

		@@ -0,0 +1,2 @@
		{{ config(materialized='table', file_format='hudi') }}

187: Adding apache hudi support to dbt #210

187: Adding apache hudi support to dbt #210

Conversation

vingov commented Aug 30, 2021

Description

Checklist

atul016 commented Sep 22, 2021

vingov commented Sep 22, 2021

vinothchandar commented Sep 23, 2021

rubenssoto commented Oct 14, 2021

jtcohen6 commented Oct 15, 2021

vingov commented Oct 15, 2021

vingov commented Oct 15, 2021

rubenssoto commented Oct 15, 2021

vingov commented Oct 25, 2021

jtcohen6 commented Oct 25, 2021

vingov commented Nov 17, 2021

vingov commented Nov 18, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Nov 18, 2021

Choose a reason for hiding this comment

vingov Nov 18, 2021

Choose a reason for hiding this comment

jtcohen6 Nov 18, 2021

Choose a reason for hiding this comment

vingov Nov 18, 2021

Choose a reason for hiding this comment

vingov commented Nov 18, 2021

vingov commented Nov 18, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment