Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support temp tables in Snowflake "table" materializations #2725

Closed
gil-walzer-zocdoc opened this issue Aug 24, 2020 · 10 comments
Closed

Support temp tables in Snowflake "table" materializations #2725

gil-walzer-zocdoc opened this issue Aug 24, 2020 · 10 comments
Labels
enhancement New feature or request

Comments

@gil-walzer-zocdoc
Copy link

Describe the feature

Please allow us to specify a "temporary" option in the config that will create a temp table materialization.

https:/fishtown-analytics/dbt/blob/dev/marian-anderson/plugins/snowflake/dbt/include/snowflake/macros/materializations/table.sql#L26

Describe alternatives you've considered

I could create a custom materialization that does what is described, but the change seems sufficiently small and useful to go in DBT's materialization.

Additional context

Snowflake offers two alternatives to permanent tables- temporary tables and transient tables. Temp tables are dropped at the end of the session while transient tables must be explicitly dropped, otherwise they incur charges. DBT's table materialization currently forces users to use transient tables, and this should be opt-out.

Who will this benefit?

Users that need temp tables for their projects, but don't want those tables to persist between runs, would benefit from having those tables be temporary. Projects that have frequent logic changes to their intermediate transforms as well as large datasets fit this description.

Are you interested in contributing this feature?

I would be willing to make this change

@gil-walzer-zocdoc gil-walzer-zocdoc added enhancement New feature or request triage labels Aug 24, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 26, 2020

Hey @gil-walzer-zocdoc, what's your use case for temp tables as a materialization?

dbt opens several different Snowflake connections (sessions) within a given run, even more so when running with multiple threads. Since temp tables are dropped at the end of each session, they don't stick around for downstream models to select from them.

We do support a temporary argument to the create_table_as macro, and we use temp tables as a tool within more complex materializations. E.g. the Snowflake incremental materialization creates a temp table as the first step in an incremental run.

If you don't want those models sticking around as tables between runs—to avoid storage costs, I'm guessing?—I'd advise:

  • Materializing them as ephemeral models or views instead. (This wouldn't work as well for large datasets.)
  • Adding an on-run-end hook that executes a series of drop table if exists statements.

@jtcohen6 jtcohen6 removed the triage label Aug 26, 2020
@gil-walzer-zocdoc
Copy link
Author

Ah ok, I didn't realize dbt actually opened up several Snowflake connections, that certainly dashes my hopes.

I described our use case in the description; we're operating on a large dataset, but the domain is still pretty new to our business so our transforms are still going through a lot of logic changes. So we wanted to store intermediate transforms as temp tables, not CTEs, but we also were running into issues where the transient table would be created with a set of columns that wouldn't match the logic in a subsequent run.

I should mention that we're leveraging a Snowflake database clone to run these transforms on while we develop. I came up with a custom method to set up clones and point Snowflake towards them via the profiles .yml file, but I'd be interested if you or your team had anything in mind to natively support using Snowflake clones in DBT.

@jtcohen6
Copy link
Contributor

Right on, that makes sense.

we also were running into issues where the transient table would be created with a set of columns that wouldn't match the logic in a subsequent run.

I don't think I understand this piece. In subsequent runs, does the transient intermediate table encounter a mismatch with data changes in upstream sources, or with logic changes in downstream models? Wouldn't you be rerunning the intermediate model (and recreating the transient table) if it's within the modified section of the DAG?

Zero-copy cloning is a powerful Snowflake feature. Since it's available as DDL statements, we've found it to play nicely with existing dbt constructs (hooks and operations), and we haven't felt the need to wrap it more natively in an adapter method or materialization. Claire has a great discourse post about quickly spinning up a dev environment via ZCCs.

@gil-walzer-zocdoc
Copy link
Author

The issue was actually just for a single model which had a data change between two runs, not for data changes between models. We were using an incremental materialization, which was possibly incorrect.

What I was seeing was this, and remember this is only about a single model: On the first run of our DAG (on a fresh database), DBT created transient tables from the columns specified in the model. On a subsequent run, the output columns in the model changed, and the materialization prepared a new temp table (suffixed with _dbt_tmp) with the correct columns, but did not replace the transient table before attempting to insert.

If you believe we're misusing the incremental materialization and would be better served with an alternative, please let us know.

@jtcohen6
Copy link
Contributor

Got it! I don't think you're misusing the incremental materialization, though you should be aware that the inability to capture column additions/deletions/changes is a significant limitation. There's a long-lived issue with proposed remedies: #1132. Generally, models that frequently update their columns are poor candidates for incremental builds.

I'm going to close this issue, in favor of the bigger-picture discussion, since there isn't a specific code change we'll be making to support temp tables as a materialization.

@gil-walzer-zocdoc
Copy link
Author

Sounds good. Thanks for the discussion.

@gil-walzer-zocdoc
Copy link
Author

Hi @jtcohen6, we at Zocdoc have started looking into supporting concurrent DBT runs without threading (as separate processes). I wanted to double-check that your point above is still accurate- that Snowflake DBT opens and closes sessions between models. Is that still true?

@jtcohen6
Copy link
Contributor

Yes, that's still true. This rough mapping holds: one model = one materialization = one Snowflake session/connection (possibly more if additional queries are run before/after)

@jim256
Copy link

jim256 commented Jun 21, 2022

Hey, @jtcohen6.

We're migrating from traditional SQL transformations. We used temp tables in those because the datasets were large. I'd love to stick with this instead of using CTEs, so I'm thinking transient tables. I also find temp tables more readable, easier to debug, more modular, etc.

The downside with anything but CTEs in dbt is naming issues. We have 100+ transformations that run every hour. It's likely that two separate transformations would create "sales" temp tables that wouldn't use the same logic. Since temp tables are scoped to their session, we didn't have to worry about the temp table names of one transformation conflicting with those of another. I was hoping temp tables in dbt would do the trick, but sounds like that's out.

Any recommendations on how to avoid naming conflict issues without an unreasonable name for every temp table (like [this_transformation_name]_sales) ?

@KeeonTabrizi
Copy link

I personally, would love to see support for temporary tables and the ability to have the model run with a single thread to avoid the issues with multiple sessions DBT creates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants