Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When on_schema_change is set, pass common columns as dest_columns in incremental merge macros #4144

Closed
1 task done
ccharlesgb opened this issue Oct 27, 2021 · 10 comments · Fixed by dbt-labs/dbt-bigquery#51, #4170, dbt-labs/dbt-snowflake#38 or dbt-labs/dbt-redshift#37
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! incremental Incremental modeling with dbt snowflake

Comments

@ccharlesgb
Copy link

ccharlesgb commented Oct 27, 2021

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We have noticed that when removing a column from an incremental model that uses insert_overwrite and BigQuery that this causes dbt to fail.

Expected Behavior

We had initially assumed this was intended behaviour on dbt's part because of how schema changes with incremental were not supported until 0.21 however after reading the documentation. We thought maybe this was not intentional and that the missing column should just be ignored.

Steps To Reproduce

  1. Create an incremental model with 3 columns:
{{
  config(
    alias = 'incr_test',
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by={'field': 'event_time', 'data_type': 'date'}
  )
}}

SELECT
  *
from (SELECT DATE('2021-01-01') as `event_time`,
      'a' as `col_a`,
      'b' as `col_b`)
  1. Run the model against a BigQuery instance and then remove col_b:
SELECT
  *
from (SELECT DATE('2021-01-01') as `event_time`,
      'a' as `col_a`)
  1. Run the model again and you will see this error:
Completed with 1 error and 0 warnings:

Database Error in model incr_test (incr_test.sql)
  Query error: Unrecognized name: col_b; Did you mean col_a? at [57:33]
  compiled SQL at target/run/auto_trader/models/incr_test.sql

Relevant log output

Completed with 1 error and 0 warnings:

Database Error in model incr_test (incr_test.sql)
  Query error: Unrecognized name: col_b; Did you mean col_a? at [57:33]
  compiled SQL at target/run/auto_trader/models/incr_test.sql


### Environment

```markdown
- OS: Mac OS
- Python: 3.8
- dbt: 0.20.1

What database are you using dbt with?

bigquery

Additional Context

No response

@ccharlesgb ccharlesgb added bug Something isn't working triage labels Oct 27, 2021
@github-christophe-oudar
Copy link

github-christophe-oudar commented Oct 27, 2021

I was going to report the same issue, I think it's a typical use case.
I have a similar example with:

SELECT DATE('2021-10-02') as date, 1 as field1

where then would be updated to:

SELECT DATE('2021-10-02') as date, 2 as field2

In my test, both incremental_strategy='merge' and incremental_strategy='insert_overwrite' have the issue.

Indeed in my case, the debug mode leads to the query:

merge into `project`.`dataset`.`table` as DBT_INTERNAL_DEST
        using (
          select * from `project`.`dataset`.`table__dbt_tmp`
        ) as DBT_INTERNAL_SOURCE
        on FALSE



    when not matched then insert
        (`date`, `field1`, `field2`)
    values
        (`date`, `field1`, `field2`)

Obviously the table__dbt_tmp schema (output of second query) only contains date and field2 and therefore it leads to an error.
I tracked it down to https:/dbt-labs/dbt-core/blob/main/core/dbt/include/global_project/macros/materializations/common/merge.sql#L106-L109 where it appears that the query is basing the merge condition on the destination table columns but it requires to actually base in on the table__dbt_tmp columns.

The resulting when condition would become

when not matched then insert
        (`date`, `field2`)
    values
        (`date`, `field2`)

I'm not 100% sure it covers every use case but since we're not caring about field1, I suspect it would work as intended.

In the meantime, I see 2 workarounds:

  • Add the select to NULL on the columns you're not inserting, ie in my example:
SELECT DATE('2021-10-02') as date, NULL as field1, 2 as field2
  • Delete the column in the destination table though it might not be an option you'll likely want to use a fallback on your field1 or start backfilling field2 while still reading field1 etc.

Regarding that second option, even if you went with it through using on_schema_change='sync_all_columns', well it looks like there is another issue as DBT will generate following query:

 alter table `project`.`dataset`.`table`   
               ,
   
                   drop column field1

where we can see it's the comma that shouldn't be here.

@jtcohen6, that's the issue I was reporting to you.

Obviously it's quite a blocker to me for more production usage of incremental materialization.
I can help to fix it if we can agree on the solution as both workarounds are pretty unsatisfying.

Let me know if you find something better. Thanks!

@jtcohen6
Copy link
Contributor

@ccharlesgb Thanks for opening, and @github-christophe-oudar, thanks for finding :)

I do think the move here is to use the on_schema_change config, introduced in v0.21. Going forward, that's going to be our official answer for "handle column schema changes in incremental models."

As you saw, there's a syntax error with sync_all_columns if you're only removing columns. That comma shouldn't be there! I just opened an issue for it (#4147), and a quick fix to match (#4147).

@jtcohen6 jtcohen6 removed the triage label Oct 27, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 27, 2021

So, we can try that again as:

{{
  config(
    alias = 'incr_test',
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    partition_by={'field': 'event_time', 'data_type': 'date'},
    on_schema_change='sync_all_columns'
  )
}}

SELECT
  *
from (SELECT DATE('2021-01-01') as `event_time`,
      'a' as `col_a`,
      'b' as `col_b`)

Change that to:

... same config ...

SELECT
  *
from (SELECT DATE('2021-01-01') as `event_time`,
      'a' as `col_a`)

And, with the fix from #4147, e.g. by copy-pasting the fixed version of alter_relation_add_remove_columns into my own project's macros/ folder, it runs perfectly.

@github-christophe-oudar

@jtcohen6 Thanks for the quick fix!
I think that most people are likely to use on_schema_change='append_new_columns'.
That's the single (and default) mode we support in one of Teads internal tool as dropping a column is pretty risky (if you plan to rollback in case of an issue for instance).
Let me know if I can help (or review) 👋

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 28, 2021

@github-christophe-oudar Ah, okay, I hear you!

To get this working with on_schema_change='append_new_columns', in the case where you've dropped the column from the incremental model query, but want to keep that column in the target table (with null values), we want the merge query to end up with:

when not matched then insert
        (`date`, `field2`)
    values
        (`date`, `field2`)

I confirmed that both BigQuery and Snowflake are both smart enough to null out any un-specified columns, both for merge and insert statements.

To that end, I think the fix here would be to adjust the dest_cols passed into the various incremental merge macros. Instead of being every column from the destination table, we'd want it to be the intersection of columns from the source query and destination table. This only works in cases where we've already materialized the source query as a temporary table—the good news is that we always do that for on_schema_change, to compare column schemata.

I think the right place for this code change is in the BigQuery + Snowflake incremental materializations. In cases where we know a temp table exists, because we just used it to process on_schema_change, we could:

  • run adapter.get_columns_in_relation for both and take the intersection (even though this is redundant with the work done just before)
  • adjust the process_schema_changes macro to return the new shared column schema, and then set that to dest_columns

Some of these changes will need to happen in this repo, some in other repos (dbt-bigquery, dbt-snowflake). We can keep this issue open for now as a central spot for discussion.

Is that something you might be interested in giving a go? :)

@jtcohen6 jtcohen6 changed the title [Bug] Removing columns from incremental causes Query error in Big Query [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros Oct 28, 2021
@jtcohen6 jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! snowflake enhancement New feature or request incremental Incremental modeling with dbt and removed bug Something isn't working labels Oct 28, 2021
@ccharlesgb
Copy link
Author

@jtcohen6 Thanks for the explanation here, when we looked into this further we came to the same conclusion about dest_cols.

Would the change to use the intersection of columns also be applied to on_schema_change=ignore? Although we intend to use on_schema_change when we upgrade to 0.21 it also feels like the default behaviour should be able to gracefully handle the removal of a column and have BigQuery replace it with null. If this isn't possible it would be good for the documentation to be updated regarding ignore

@github-christophe-oudar

@jtcohen6 Alright that solution looks great!
Indeed the intersection is needed for on_schema_change='ignore' as destination table would miss new columns from the temporary table.
It doesn't sound too complicated, especially with the pointers you added so I can definitely look into it by the end of the week. Thanks!

@ccharlesgb I think that would be the behavior for ignore, it would put NULL values on columns that wouldn't exist in both models schema and likely fail if there is a column that isn't nullable.

@jtcohen6
Copy link
Contributor

The only limitation for on_schema_change: ignore is around whether the temporary table already exists, such that we can check its schema. In some incremental strategies, the default incremental behavior passes the un-materialized SQL query as merge source. I agree that, wherever possible, it's desirable to support this behavior.

@jtcohen6 jtcohen6 changed the title [SF, BQ] When on_schema_change is set, pass common columns as dest_columns in incremental merge macros When on_schema_change is set, pass common columns as dest_columns in incremental merge macros Nov 7, 2021
@Kayrnt
Copy link
Contributor

Kayrnt commented Nov 7, 2021

@jtcohen6 all tests added and PRs are green ✅

@Kayrnt
Copy link
Contributor

Kayrnt commented Nov 8, 2021

Happy to have that issue solved!
Thank you @jtcohen6 for moving forward quickly 💪
I'm looking forward for the next release 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! incremental Incremental modeling with dbt snowflake
Projects
None yet
4 participants