Efficient ways to handle column additions / deletions for incremental models #823

chuangliuconvoy · 2018-07-03T21:24:54Z

My company has some very large tables (billions of rows). Engineers add or delete columns from these tables several times per month. We use incremental dbt models to process data in these redshift tables (named as source tables), and output results to another set of redshift tables (named as target tables), and need full-refresh a table every time a column is added / deleted. Because full-refresh involves billions of rows, I want to know if there is a way to avoid it.

First case is for adding an empty column to a source table. The incremental data model simply reads data from source table and writes to a target table. Instead of doing a full-refresh, can I just add the new column in the target table and refer to the new column in the incremental model ?

Second case is for dropping a column from a source table. The incremental data model simply reads data from source table and writes to a target table. Instead of doing a full refresh, can I just drop the column from the target table, and remove the column reference from the incremental model ?

Thanks.

drewbanin · 2018-07-03T22:11:23Z

Thanks for the writeup @chuangliuconvoy! There's a somewhat related issue over here: #320

I think the big question is "how should an incremental model behave when its schema changes?"

I think there are generally three options:

Do nothing, and ignore any new columns produced by the model (current behavior)
Automatically run a full-refresh (see linked issue above)
Create a new column in the target table, with historical data nulled out.

The current approach, item 1, is passable. The user is encouraged to run a --full-refresh at their discretion to fill in data for the new field(s), but no data will appear in the target table until the --full-refresh is run.

The second approach is pretty untenable for dbt at the moment. dbt is stateless, so there's no good way to identify when a model's SQL has changed between runs.

The third approach is definitely more feasible than the second approach, but has a similar kind of idea. Instead of full-refreshing the model, dbt could add or remove columns to make the target schema match the schema produced by the model SQL. There is some existing code here that's responsible for "expanding" column types to make model SQL and the target table share a compatible schema. In the future, this function could also add new columns and delete extraneous columns as required to make the target schema and the model SQL identical.

The caveat of approach 3 is of course that the columns will exist, but they will not be backfilled with data as with a --full-refresh.

I'm unsure if this proposed implementation is more or less confusing than the current implementation, but I know that folks have a lot of questions about what happens when incremental logic changes, and I wonder if a change like this wouldn't help to mitigate the confusion, with the added benefit of being more performant in the case of issues like @chuangliuconvoy's.

...

@chuangliuconvoy we discussed your question in Slack, but I'll repeat it here for the sake of anyone reading this issue in the future:

I'd avoid mutating the tables that dbt produces manually, as these manual migrations are both error-prone and unversioned. Ideally, dbt would be able to run this sort of "migration" for you as described above, such that the behavior of incremental models will be consistent across dev and prod environments.

chuangliuconvoy · 2018-07-05T18:23:15Z

@drewbanin I went through dbt documentation, and found an interesting function "get_missing_columns" in adapter https://docs.getdbt.com/v0.10/reference#adapter

For the use case of adding a new column to a simple incremental model that just copy data from one table to another, instead of doing a full-refresh, can I do something like following before each incremental model run?

{% for col in adapter.get_missing_columns(my_src_schema, this.name, this.schema, this.name) %}
alter table {{this}} add column "{{col.name}}" {{col.data_type}};
{% endfor %}

drewbanin · 2018-11-15T14:28:41Z

Closing this in favor of #1132, which is more actionable.

drewbanin added the enhancement New feature or request label Jul 3, 2018

drewbanin closed this as completed Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient ways to handle column additions / deletions for incremental models #823

Efficient ways to handle column additions / deletions for incremental models #823

chuangliuconvoy commented Jul 3, 2018

drewbanin commented Jul 3, 2018

chuangliuconvoy commented Jul 5, 2018

drewbanin commented Nov 15, 2018

Efficient ways to handle column additions / deletions for incremental models #823

Efficient ways to handle column additions / deletions for incremental models #823

Comments

chuangliuconvoy commented Jul 3, 2018

drewbanin commented Jul 3, 2018

chuangliuconvoy commented Jul 5, 2018

drewbanin commented Nov 15, 2018