Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

187: Adding apache hudi support to dbt #210

Merged
merged 24 commits into from
Nov 19, 2021

Conversation

vingov
Copy link
Contributor

@vingov vingov commented Aug 30, 2021

resolves #187

Description

Apache Hudi brings ACID transactions, record-level updates/deletes, and change streams to data lakes. Both Hudi & dbt are great technologies, this PR integrates the apache hudi file-format support to dbt to allow users to create and model hudi datasets using dbt.

This PR adds one more file format which supports incremental merge strategy, now users can use this feature on all spark environments. In addition to the delta format which works only on databricks runtime environment.

Tested locally:

Found 5 models, 10 tests, 0 snapshots, 0 analyses, 169 macros, 0 operations, 0 seed files, 0 sources, 0 exposures

15:58:41 | Concurrency: 1 threads (target='local')
15:58:41 |
15:58:41 | 1 of 5 START incremental model analytics.hudi_insert_table........... [RUN]
15:59:40 | 1 of 5 OK created incremental model analytics.hudi_insert_table...... [OK in 59.70s]
15:59:40 | 2 of 5 START incremental model analytics.hudi_insert_overwrite_table. [RUN]
16:00:12 | 2 of 5 OK created incremental model analytics.hudi_insert_overwrite_table [OK in 31.27s]
16:00:12 | 3 of 5 START incremental model analytics.hudi_upsert_table........... [RUN]
16:00:32 | 3 of 5 OK created incremental model analytics.hudi_upsert_table...... [OK in 20.90s]
16:00:32 | 4 of 5 START incremental model analytics.hudi_upsert_partitioned_cow_table [RUN]
16:00:54 | 4 of 5 OK created incremental model analytics.hudi_upsert_partitioned_cow_table [OK in 21.59s]
16:00:54 | 5 of 5 START incremental model analytics.hudi_upsert_partitioned_mor_table [RUN]
16:01:15 | 5 of 5 OK created incremental model analytics.hudi_upsert_partitioned_mor_table [OK in 20.76s]
16:01:15 |
16:01:15 | Finished running 5 incremental models in 174.02s.

Completed successfully

Done. PASS=5 WARN=0 ERROR=0 SKIP=0 TOTAL=5

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests.
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Aug 30, 2021
@atul016
Copy link

atul016 commented Sep 22, 2021

@vingov What is the progress on this ? are you still working on this?

@vingov
Copy link
Contributor Author

vingov commented Sep 22, 2021

Yes @atul016, I will fix the integration test in a few days.

The code works in Spark 3, and hits an edge case in Spark 2, the fix should be in apache hudi repo or possibly a config change here.

Apart from the integration tests, do you have any other questions or comments about this PR?

@vinothchandar
Copy link

@atul016 There is a lot of user interest for this. (I am the PMC chair for Apache Hudi). Please let us know how we can help take this forward.

@rubenssoto
Copy link

@vingov how are you?

Do you have news about this PR...this is a very useful feature :)

@jtcohen6
Copy link
Contributor

@vingov Thank you for this amazing contribution, and for the detailed testing you're adding along the way! I would love to include this in dbt-spark==v1.0.0, which we'll be releasing in December to coordinate with dbt-core v1.0.0. I'd be happy to work with you to get this over the finish line.

It looks like you're running into a tight spot around the integration tests, related to Spark 2. I'd have no opposition to upgrading the containerized cluster we run in CI, so that it uses Spark 3 instead, but we've struggled with that update in the past (#145).

Also, as a fair warning, we're likely to move integration tests from CircleCI to GitHub Actions (as we've done for other plugins), and to lightly refactor the way we're setting up those integration tests (#227). I don't think that should impact the majority of your changes, I just don't want a big merge conflict to come as a surprise.

@vingov
Copy link
Contributor Author

vingov commented Oct 15, 2021

@vingov how are you?

Do you have news about this PR...this is a very useful feature :)

Hey @rubenssoto - Sorry, I was on a vacation for the last few weeks, I'm back and I'll get this landed soon.

@vingov
Copy link
Contributor Author

vingov commented Oct 15, 2021

@jtcohen6 - Thanks for the insights, I was about to ask you about updating the CI to use Spark 3, I will dig deeper into the gaps on #145.

Thanks for the heads-up on GitHub Actions.

I will iterate on Spark 3, after my findings, we can work together to get this PR landed. I'm on the dbt slack as well, you can reach me over there to iterate faster.

@rubenssoto
Copy link

@vingov dont be sorry, thank you so much for your work :)

@vingov
Copy link
Contributor Author

vingov commented Oct 25, 2021

@jtcohen6 - Hey, can you please approve the CI workflow, to run the integration tests? It's stopped with a message that it needs a maintainer to approve running workflows.

@jtcohen6
Copy link
Contributor

@vingov Approved to run unit tests and code checks via GitHub Actions. We're still mid-cutover between CircleCI and GHA

@vingov
Copy link
Contributor Author

vingov commented Nov 17, 2021

Hey @jtcohen6 - can you please approve the CI workflow, to run the integration tests? I rebased the fixed the integration tests, ran the circle ci locally to test it out as well.

@vingov
Copy link
Contributor Author

vingov commented Nov 18, 2021

@jtcohen6 - I'm really sorry to bug you again, last time I checked & fixed only the integration-spark-thrift circle ci tests.

The databricks tests were not running in my local, hence could not test it out locally, now I have fixed that databricks error as well, can you please approve the workflow again? thanks in advance.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vingov Sure thing! It looks like the one failing test may be related to lack of hudi support on Databricks. I'd recommend disabling that model for those tests or, if you see fit, setting up the persist_docs test case to run on Apache Spark + Hudi as well.

@@ -0,0 +1,2 @@
{{ config(materialized='table', file_format='hudi') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking that this is failing on Databricks because the hudi file format is not available there?

This specific test case (tests.integration.persist_docs) isn't running on Apache Spark right now. You're welcome to either:

  • add a test to test_persist_docs.py, separate from test_delta_comments, with @use_profile("apache_spark"), and configure this model to be enabled only for that test, and disabled when running with a databricks profile
  • disable this model for the time being

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, since there were many iterations on this PR already, For now, I'll disable the model to keep it simple and merge this PR, later in the next iteration I'll bring back both these tests.

@@ -77,6 +81,26 @@ def run_and_test(self):
def test_delta_strategies_databricks_cluster(self):
self.run_and_test()

# Uncomment this hudi integration test after the hudi 0.10.0 release to make it work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Out of curiosity, what's the change coming in v0.10 that will make this sail smoothly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark SQL DML support has been added to Apache Hudi recently with the 0.9.0 release, but there were a few gaps that got fixed after we released the last version, which is scheduled for the next release in a few weeks.

Most specifically, these commits are the ones that are relevant to making these tests run smoothly.

@vingov
Copy link
Contributor Author

vingov commented Nov 18, 2021

@jtcohen6 - Please approve the workflow one last time, thanks!

@vingov
Copy link
Contributor Author

vingov commented Nov 18, 2021

@jtcohen6 - Finally all the integration tests passed, I guess it still needs your approval for running the python 3.8 unit tests.

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vingov Thank you for the contribution! Very neat to be able to include this in time for v1 :)

@jtcohen6 jtcohen6 merged commit 68a3b5a into dbt-labs:main Nov 19, 2021
leahwicz added a commit that referenced this pull request Dec 3, 2021
* Refactor seed macros, clearer sql param logging (#250)

* Try refactoring seed macros

* Add changelog entry

* 187: Adding apache hudi support to dbt (#210)

* initial working version

* Rebased and resolve all the merge conflicts.

* Rebased and resolved merge conflicts.

* Removed hudi dep jar and used the released version via packages option

* Added insert overwrite unit tests for hudi

* Used unique_key as default value for hudi primaryKey option

* Updated changelog.md with this new update.

* Final round of testing and few minor fixes

* Fixed lint issues

* Fixed the integration tests

* Fixed the circle ci env to add hudi packages

* Updated hudi spark bundle to use scala 2.11

* Fixed Hudi incremental strategy integration tests and other integration tests

* Fixed the hudi hive sync hms integration test issues

* Added sql HMS config to fix the integration tests.

* Added hudi hive sync mode conf to CI

* Set the hms schema verification to false

* Removed the merge update columns hence its not supported.

* Passed the correct hiveconf to the circle ci build script

* Disabled few incremental tests for spark2 and reverted to spark2 config

* Added hudi configs to the circle ci build script

* Commented out the Hudi integration test until we have the hudi 0.10.0 version

* Fixed the macro which checks the table type.

* Disabled this model since hudi is not supported in databricks runtime, will be added later

* Update profile_template.yml for v1 (#247)

* Update profile_template.yml for v1

* PR feedback, fix indentation issues

* It was my intention to remove the square brackets

* Fixup changelog entry

* Merge main, update changelog

* Bump version to 1.0.0rc2 (#259)

* bumpversion 1.0.0rc2

* Update changelog

* Use pytest-dbt-adapter==0.6.0

* Corrected definition for set full_refresh_mode (#262)

* Replaced definition for set full_refresh_mode

* Updated changelog

* Edit changelog

Co-authored-by: Jeremy Cohen <[email protected]>

* `get_response` -> `AdapterResponse` (#265)

* Return AdapterResponse from get_response

* fix flake

Co-authored-by: Jeremy Cohen <[email protected]>
Co-authored-by: Vinoth Govindarajan <[email protected]>
Co-authored-by: Sindre Grindheim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for apache Hudi
5 participants