Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a BigQuery adapter macro to enable usage of CopyJobs #2709

Merged
merged 25 commits into from
Aug 19, 2020

Conversation

kconvey
Copy link
Contributor

@kconvey kconvey commented Aug 17, 2020

resolves #2696

Description

Adds a macro copy_table to use BigQuery's CopyJob, which accepts a relation like
{{ copy_table(ref('merged_table')) }} and can be invoked when the materialization is table or incremental (to truncate or append to the destination table).

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Aug 17, 2020
@kconvey kconvey changed the title Add copy_job macro Add a BigQuery adapter macro to enable usage of CopyJobs Aug 17, 2020
Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR, these changes mostly look good! I see one issue where you accessed something that we should have hidden (sorry) and it doesn't exist anymore. I assume your tests will fail because of it.

I also had a flavor/style-related suggestion around the split between the adapter and the connection manager.

@kconvey
Copy link
Contributor Author

kconvey commented Aug 17, 2020

@beckjake made an attempt at some integration tests, but I can't say I've got super high confidence they're right (and I don't think I can kick them off)

@beckjake
Copy link
Contributor

@kconvey They should automatically kick off once you push, no need for us to intervene anymore! It looks flake8 is upset about some newline/indentation related triviality.

@beckjake
Copy link
Contributor

@kconvey I think the tests for your new changes aren't showing up at the bottom because of the CHANGELOG.md conflict. I can fix that up for you via the web interface if you'd like, or you can do it. Either way, it should fix the box at the bottom saying they failed

@kconvey
Copy link
Contributor Author

kconvey commented Aug 17, 2020

I've been looking at the tests at the commit level (rather than the bottom of the comments here), and they've been correctly failing, so I've been iterating on them. Planning to merge CHANGELOG.md at the end.

@kconvey
Copy link
Contributor Author

kconvey commented Aug 17, 2020

@beckjake got the following:

self = <test_bigquery_copy_models.TestBigqueryDatePartitioning testMethod=test__bigquery_copy_table>

    @use_profile('bigquery')
    def test__bigquery_copy_table(self):
        results = self.run_dbt(expect_pass=False)
>       self.assertEqual(len(results), 4)
E       TypeError: object of type 'NoneType' has no len()

test/integration/022_bigquery_test/test_bigquery_copy_models.py:39: TypeError

I'm guessing I should maybe separate the models I expect to pass in one run and the model I expect to fail in another, but should I not be getting a results list here?

@beckjake
Copy link
Contributor

beckjake commented Aug 18, 2020

I'm guessing I should maybe separate the models I expect to pass in one run and the model I expect to fail in another, but should I not be getting a results list here?

I think you'll have to for this test. What's happening is that the NotImplementedException triggers at parse time (so no results), and it happens to be a very special exception that doesn't get re-raised in handling, so run_dbt returns. It ends up looking pretty stupid on the integration test side of things, unfortunately.

@beckjake
Copy link
Contributor

I'm looking through this PR and I'm less sure about it.

It seems to me that:

  • copy_table is a macro that takes a Relation as its argument
    • when called, it performs a copy of the original table and moves it into the current table's location
    • it also sets the materialization to copy
  • that materialization type is implemented, and it:
    • runs pre-hooks
    • writes the sql to disk
    • runs post-hooks

I think I understand why this is the path you've gone down (you can't easily pass the source into the materialization), but this will be a weird behavior that doesn't work with the rest of dbt. In particular, I think dbt compile will trigger a copy!

Here's an alternative approach that might be better:

  • remove the copy macro
  • make copy a proper materialization type
  • have the copy materialization extract the ref/source calls from the model's refs and sources lists
  • have it call that, pass it to

do something like this (untested, but...):

  {# there should be exactly one ref or exactly one source #}
  {% set destination = this.incorporate(type='table') %}

  {% set dependency_type = none %}
  {% if (model.refs | length) == 1 and (model.sources | length) == 0 %}
    {% set dependency_type = 'ref' %}
  {% elif (model.refs | length) == 0 and (model.sources | length) == 1 %}
    {% set dependency_type = 'source' %}
  {% else %}
    {% set msg %}
        Expected exactly one ref or exactly one source, instead got {{ model.refs | length }} models and {{ model.sources | length }} sources.
    {% endset %}
    {% do exceptions.raise_compiler_error(msg) %}
  {% endif %}

  {% if dependency_type == 'ref' %}
    {% set src =  ref(*model.refs[0]) %}
  {% else %}
    {% set src =  source(*model.sources[0]) %}
  {% endif %}

  {{ adapter.copy_table(
      src,
      destination,
      config.get('copy_materialization')) }}

  {# Clean up #}
  {{ run_hooks(post_hooks) }}
  {{ adapter.commit() }}

  {{ return({'relations': [destination]}) }}

Then your models, instead of looking like

{{ copy_table(ref('whatever')) }}

Could look like this:

{{ config(materialized='copy') }}
select * from {{ ref('whatever') }}

The select * from would be ignored (I guess you could assert on it if you wanted?), the materialization would just pick up the refs.

@kconvey
Copy link
Contributor Author

kconvey commented Aug 18, 2020

That's a good point, I forgot about dbt compile here. Going to take a stab at doing it this way.

@kconvey
Copy link
Contributor Author

kconvey commented Aug 19, 2020

Looks to be passing, and can hopefully be merged. Thanks for the suggestions & help with this @beckjake !

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks!

@beckjake
Copy link
Contributor

I'm going to manually resolve these conflicts and merge this, they look minor.

@kconvey
Copy link
Contributor Author

kconvey commented Aug 19, 2020

I'm assuming one is probably the CHANGELOG.md after #2711

@beckjake
Copy link
Contributor

Yup, that and an import in the unit tests. As soon as the bq tests pass, I'm merging this so we can get rc1 out!

@beckjake beckjake merged commit 4273cc9 into dbt-labs:dev/marian-anderson Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support CopyJobs in BigQuery via a macro
2 participants