Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient ways to handle column additions / deletions for incremental models #823

Closed
chuangliuconvoy opened this issue Jul 3, 2018 · 3 comments
Labels
enhancement New feature or request

Comments

@chuangliuconvoy
Copy link

My company has some very large tables (billions of rows). Engineers add or delete columns from these tables several times per month. We use incremental dbt models to process data in these redshift tables (named as source tables), and output results to another set of redshift tables (named as target tables), and need full-refresh a table every time a column is added / deleted. Because full-refresh involves billions of rows, I want to know if there is a way to avoid it.

First case is for adding an empty column to a source table. The incremental data model simply reads data from source table and writes to a target table. Instead of doing a full-refresh, can I just add the new column in the target table and refer to the new column in the incremental model ?

Second case is for dropping a column from a source table. The incremental data model simply reads data from source table and writes to a target table. Instead of doing a full refresh, can I just drop the column from the target table, and remove the column reference from the incremental model ?

Thanks.

@drewbanin
Copy link
Contributor

Thanks for the writeup @chuangliuconvoy! There's a somewhat related issue over here: #320

I think the big question is "how should an incremental model behave when its schema changes?"

I think there are generally three options:

  1. Do nothing, and ignore any new columns produced by the model (current behavior)
  2. Automatically run a full-refresh (see linked issue above)
  3. Create a new column in the target table, with historical data nulled out.

The current approach, item 1, is passable. The user is encouraged to run a --full-refresh at their discretion to fill in data for the new field(s), but no data will appear in the target table until the --full-refresh is run.

The second approach is pretty untenable for dbt at the moment. dbt is stateless, so there's no good way to identify when a model's SQL has changed between runs.

The third approach is definitely more feasible than the second approach, but has a similar kind of idea. Instead of full-refreshing the model, dbt could add or remove columns to make the target schema match the schema produced by the model SQL. There is some existing code here that's responsible for "expanding" column types to make model SQL and the target table share a compatible schema. In the future, this function could also add new columns and delete extraneous columns as required to make the target schema and the model SQL identical.

The caveat of approach 3 is of course that the columns will exist, but they will not be backfilled with data as with a --full-refresh.

I'm unsure if this proposed implementation is more or less confusing than the current implementation, but I know that folks have a lot of questions about what happens when incremental logic changes, and I wonder if a change like this wouldn't help to mitigate the confusion, with the added benefit of being more performant in the case of issues like @chuangliuconvoy's.

...

@chuangliuconvoy we discussed your question in Slack, but I'll repeat it here for the sake of anyone reading this issue in the future:

I'd avoid mutating the tables that dbt produces manually, as these manual migrations are both error-prone and unversioned. Ideally, dbt would be able to run this sort of "migration" for you as described above, such that the behavior of incremental models will be consistent across dev and prod environments.

@drewbanin drewbanin added the enhancement New feature or request label Jul 3, 2018
@chuangliuconvoy
Copy link
Author

@drewbanin I went through dbt documentation, and found an interesting function "get_missing_columns" in adapter https://docs.getdbt.com/v0.10/reference#adapter

For the use case of adding a new column to a simple incremental model that just copy data from one table to another, instead of doing a full-refresh, can I do something like following before each incremental model run?

{% for col in adapter.get_missing_columns(my_src_schema, this.name, this.schema, this.name) %}
alter table {{this}} add column "{{col.name}}" {{col.data_type}};
{% endfor %}

@drewbanin
Copy link
Contributor

Closing this in favor of #1132, which is more actionable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants