[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234

KeeonTabrizi · 2022-11-10T00:17:04Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

This feature request is to support using temporary tables (within a single session) instead of CTE references in the execution of a single DBT model.

To be clear as I've seen this confusion, this is not a request to materialize a temporary object - rather utilize a temporary object in the materialization itself.

I believe it would require a model specific over-ride to limit execution to within 1 thread / session (vs. the global thread default in dbt_project.yml. Additionally, the compilation / execution order would first require the execution of the temporary objects followed by the remaining query (which could actually include a CTE). For the examples below I will assume a Snowflake DB.

Currently, DBT can compile a model below:

dbt_model.sql

{{ config(
    alias="your_table",
    materialized="table"
) }}

WITH my_data AS (
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}
)

SELECT ...
FROM my_data

The run target may look something like

CREATE OR REPLACE db.schema.your_table AS (
WITH my_data AS (
      SELECT ...
      FROM db.schema.my_other_dbt_model'
)

SELECT ...
FROM my_data

);

A simple version of this query that would utilize a tempory table could look as follows:

dbt_model_temp_table.sql

{{ config(
    alias="your_table",
    materialized="table"
) }}

CREATE TEMPORARY TABLE my_data AS
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}
;

SELECT ...
FROM my_data

The run target will currently generate invalid SQL as follows:

CREATE OR REPLACE db.schema.your_table AS (
CREATE OR REPLACE my_data
      SELECT ...
      FROM db.schema.my_other_dbt_model'
;

SELECT ...
FROM my_data

);

The compiled version should first execute any create statements and then wrap the final query which generates the result set to materialize:

CREATE TEMPORARY TABLE my_data as
      SELECT ...
      FROM db.schema.my_other_dbt_model'
;

CREATE OR REPLACE db.schema.your_table AS ( --what DBT would materialize
  SELECT ...
  FROM my_data
);

What I believe is needed is an additional config argument into the model to indicate temporary tables are used which would force a single session/thread for the model execution as well as re-order the execution of the model (first executing the temporary objects in the session, and finally generating the results for DBT to materialize).

Describe alternatives you've considered

Converting everything to CTEs or creating individual upstream models is the only option. However, I feel strongly enough in the development / iteration cycles around complex queries/transformation where temporary tables are used that having the option to directly materialize those queries through DBT models will be a win for developers.

Who will this benefit?

This will benefit anyone developing complex queries as well as those already using temporary tables which can improve debugging / speed of query development as results can remain available to an open session.

It allows a developer to move a query > model > production quicker and provides that option versus being forced to convert a query into CTEs and/or break up a query into component models with dependencies.

While, this may be theoretically better data engineering practice it shouldn't be forced by the lack of support for temporary tables.

See this comment which also describes some of the benefit: #2725 (comment)

Are you interested in contributing this feature?

Happy to ideate

Anything else?

Tangential but not directly related issues & PRS:

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2022-11-14T16:26:04Z

@KeeonTabrizi thanks for a thorough write-up and links to tangential issues and PRs!

We are always interested to discuss opportunities to improve debugging / speed of query development.

Question:

is this feature request primarily intended for use during iterative development? Or is it intended for deployment to production as well?

KeeonTabrizi · 2022-11-18T22:48:39Z

@dbeatty10 thanks for the comment. It's definitely intended for production use.

The iterative development cycle can yield something that is production ready through these temporary objects and it's helpful to be an option. The alternative is to force translation of these objects back into CTE's and add more over-head to the process and ultimately lengthen the time to get something out the door when it was otherwise ready to ship.

dbeatty10 · 2022-11-19T02:57:40Z

Gotcha @KeeonTabrizi

What are you trying to optimize for in the iterative development cycle?

Having all the logic you are currently working on within the same file
Re-use within multiple downstream models
Persisting a large data set to save time of re-materializing it during each dev re-build/re-run cycle
Not persisting certain data sets beyond the end of a build (presumably to save storage costs)
Something else?

Options

As you're probably well-versed, there's a variety of options for developing dbt models including:

table materialization (permanent or transient)
ephemeral materialization
view materialization
explicit CTE within one of the above

Let's explore what each of those are optimized for:

		table	ephemeral	view	explicit CTE
A)	Having all the logic you are currently working on within the same file	❌	❌	❌	✅
B)	Re-use within multiple downstream models	✅	✅	✅	❌
C)	Persisting a large data set to save time of re-materializing it during each dev re-build/re-run cycle	✅	❌	❌	💡
D)	Not persisting certain data sets beyond the end of a build (presumably to save storage costs)	💡	✅	✅	✅

We can see in the chart above that A) and B) are generally opposites, as are C) and D). 'Tis the nature of trade-offs!

Table materialization

Optimized for B) and C)

💡 But we can unlock D) by mimicking the way Snowflake temporary tables behave in practice!

You'd just add drop table if exists statements as on-run-end hooks if you don't want the table to stick around beyond a full dbt run/build.

The lifespan of these "temp" tables would limited to a dbt "session" (spanning Snowflake connections) rather than just a single Snowflake connection session.

Ephemeral and view materializations

Optimized for B) and D)

Explicit CTE

Optimized for A) and D).

💡 Although they are not optimized for C), adding a limit X during development can reduce the size of the data set and speed up cycle times.

Example

"Temp" tables within a dbt "session"

my_data.sql

{{ config(
    materialized="table"
) }}

select ...
from {{ ref('my_other_dbt_model') }}

your_table.sql

{{ config(
    materialized="table"
) }}

select ...
from {{ ref('my_data') }}

dbt_project.yml

...
# on-run-end:
#   - "drop table if exists {{ ref('my_data') }};"

☝️ Uncomment above once you want it to act like a "temp" table in production. Leaving it commented will allow it to persist across sessions during development.

Usage

Rebuilding just the downstream table (leaving the "temp" table as-is):

dbt build -s your_table

Rebuilding both the "temp" table and its downstream consumer:

dbt build -s my_data your_table

KeeonTabrizi · 2022-11-22T18:22:17Z

@dbeatty10 many thanks for the detailed response. I will make sure to review it in detail and respond - the week is a busy with holidays but please give me a bit and I will make sure to provide responses.

KeeonTabrizi · 2022-12-09T00:07:43Z

@dbeatty10 hope my responses help. I'd also be happy to set up a quick 15 minute chat if that helps.

What are you trying to optimize for in the iterative development cycle?

Having all the logic you are currently working on within the same file

Re-use within multiple downstream models

Persisting a large data set to save time of re-materializing it during each dev re-build/re-run cycle

Not persisting certain data sets beyond the end of a build (presumably to save storage costs)

Something else?

With regard to the scenarios you described above #1 and #3 are relevant. #2 and #4 are not part of my considerations.
Overall, I do not believe "by mimicking the way Snowflake temporary tables behave in practice" is practical for the development cycle I'm thinking of.

Consider a situation where one needs to build a model where likely there is a need draft a complex query (many dependencies, many transformations) and high computation / run-time query. Let's also consider other real-world considerations where the time is limited and it's not feasible to build the best most streamlined DBT pipeline to achieve the model.

Often times when faced with such a task it is unknown what the ideal composite pieces of data are to achieve the end result. It is often a series of trial and error and takes multiple iterations to understand how to break the data up into logical components/transformations. Temporary table (instead of CTEs) allows one to keep data hot/available within a session so you can iterate on the query quickly. The end result is really no different than using CTEs. Most complex queries could also start with a series of CTEs. DBT has no issues with this, I don't think this is a question of best practices in analytics engineering it's just about supporting native functions of a database in the generation of data. In some databases like Redshift, if your sort keys etc. are not well defined to how the data is used using CTEs can have major slowdowns, but simply declaring temp tables of the equivalent CTE and asking redshift to ANALYZE the temporary objects before downstream use can well outperform the equivalent query by CTEs.

dbeatty10 · 2022-12-12T15:37:44Z

Although not at all what we would recommend, the thing you described is actually possible via pre-hooks.

In your original description, you already explained how this doesn't work as-is:
your_table.sql

{{ config(
    materialized="table"
) }}

CREATE TEMPORARY TABLE my_data AS
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}
;

SELECT ...
FROM my_data

But if you just add a pre_hook configuration and shuffle the order of content slightly as seen below, it will work.

⚠️ Note: We'd consider using pre-hooks in this way an anti-pattern and not what we'd recommend!⚠️

Instead, we'd recommend putting that selection logic in a CTE, view, ephemeral, or table materialization instead. If you need to drop tables to save storage costs, then we'd recommend using an on-run-hook as described in the example above.

your_table.sql

-- ⚠️ Using pre-hooks in this way is an anti-pattern! ⚠️
{{ config(
    materialized="table",
    pre_hook="

CREATE TEMPORARY TABLE my_data AS
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}"
) }}

SELECT ...
-- depends_on: {{ ref('my_other_dbt_model') }}
FROM my_data

See here for an explanation why the dependency might need to be forced via -- depends_on: {{ ref('my_other_dbt_model') }}

Smaller data sets

For scenarios with a lot of trial & error iteration on a large data set, we’d recommend limiting the data like this or with a limit clause or using Snowflake’s tablesample functionality. Then you can rip it out once you're done iterating (or leave it in for future developers within a target.name == '...' filter).

Hot data sets

But what about the goal of keeping a data set hot/available so you can iterate on a transformation query quickly?

A regular table materialization is the exemplary way to persist the result of a transformation within a dbt session or across dbt sessions.

But let's consider using a pre_hook to define a temp table instead. In this case, the data will disappear in between iterations and it will be re-built every time you run dbt. If the temp table takes a non-trivial amount of time to build, then the time losses will add up over many iterations. With the table materialization however, it would only incur a one-time cost until you re-run that particular upstream node. So using inline temp tables in a dbt model for the purpose of rapid development really has me scratching my head. 🤔

The only explanation I can think of:

Maybe you are doing all your development cycles using temp tables outside of dbt and then copy-pasting it into a dbt project once you are done? And then you're trying to make the minimum number of changes to the SQL after pasting?

github-actions · 2023-03-13T01:58:32Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-03-20T01:59:20Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions · 2023-03-20T01:59:20Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

KeeonTabrizi · 2023-03-20T04:02:35Z

Oh no closed!

At the end of the day I believe this is a bad outcome to remain unsupported.

If one can build a model with a CTE you should be able to build it with a temp table 1:1.

If you need to get a model out the door and you are spending too much time waiting for results from your CTE you shouldn't be forced to break up the model early in it's development into materialized precursors nor should you be forced to have to convert your temp table work at the very end to because temp tables are not supported. There are real business constraints and timelines that not having this flexibility can penalize.

I understand your argument with respect to best practices but ultimately it's simply a function of lack of support for a normal database mechanism in the name of a best practice, so I have a fundamental disagreement. I understand there's a whole other can of technical worms with respect to sessions/threading that get's introduced but that is a different conversation.

dongchris · 2023-03-23T02:59:10Z

Would love this feature as well! :)

KeeonTabrizi · 2023-03-24T01:51:37Z

@dbeatty10 can we open this back up

jtcohen6 · 2023-03-24T14:44:09Z

@dongchris @KeeonTabrizi I think @dbeatty10 did a nice job above laying out the rationale for why dbt both cannot and does not support this, for reasons of technical limitation (how dbt manages database connections) and best practice (code & workflow in dev should closely mirror the code that will be actually be running in prod), respectively.

In the meantime, if you really want this support, I'd recommend you try out Doug's suggestion above, where you pass your "CTE" into a pre-hook that creates it as a temp table. You could even macro-tize this logic, with create if not exists, so as to reuse it across your models, and avoid rebuilding if it already exists in your current session. Again, we won't be adding this into dbt-core natively, for all the reasons Doug articulated in his comments above.

dongchris · 2023-03-24T15:11:48Z

Thanks @jtcohen6 for the explanation. If we use the pre hook way, what if there are actual prehooks running, would having it in the config this way override existing ones?

KeeonTabrizi · 2023-03-24T21:55:39Z

@jtcohen6 -

@dbeatty10 did indeed do a great job RE: rational - but I don't believe it really was about the technical reasons it is not supported - perhaps implied, but the discussion and rationale was squarely on the analytic engineering side.

KeeonTabrizi added enhancement New feature or request triage labels Nov 10, 2022

github-actions bot changed the title ~~[Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization.~~ [CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. Nov 10, 2022

dbeatty10 added awaiting_response and removed triage labels Nov 14, 2022

github-actions bot added triage and removed awaiting_response labels Nov 18, 2022

dbeatty10 added awaiting_response and removed triage labels Nov 19, 2022

github-actions bot added triage and removed awaiting_response labels Nov 22, 2022

dbeatty10 added awaiting_response and removed triage labels Nov 22, 2022

github-actions bot added triage and removed awaiting_response labels Dec 9, 2022

dbeatty10 mentioned this issue Dec 12, 2022

[CT-1587] [Feature] Allow ephemeral models to be temporary tables dbt-labs/dbt-snowflake#338

Closed

3 tasks

dbeatty10 added awaiting_response and removed triage labels Dec 12, 2022

github-actions bot added the stale Issues that have gone stale label Mar 13, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2023

github-actions bot added triage and removed awaiting_response labels Mar 20, 2023

jtcohen6 removed the triage label Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234

[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234

KeeonTabrizi commented Nov 10, 2022

dbeatty10 commented Nov 14, 2022

KeeonTabrizi commented Nov 18, 2022

dbeatty10 commented Nov 19, 2022

KeeonTabrizi commented Nov 22, 2022

KeeonTabrizi commented Dec 9, 2022

dbeatty10 commented Dec 12, 2022

github-actions bot commented Mar 13, 2023

github-actions bot commented Mar 20, 2023

github-actions bot commented Mar 20, 2023

KeeonTabrizi commented Mar 20, 2023

dongchris commented Mar 23, 2023

KeeonTabrizi commented Mar 24, 2023

jtcohen6 commented Mar 24, 2023 •

edited

Loading

dongchris commented Mar 24, 2023

KeeonTabrizi commented Mar 24, 2023

[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234

[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234

Comments

KeeonTabrizi commented Nov 10, 2022

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dbeatty10 commented Nov 14, 2022

KeeonTabrizi commented Nov 18, 2022

dbeatty10 commented Nov 19, 2022

Options

Table materialization

Ephemeral and view materializations

Explicit CTE

Example

"Temp" tables within a dbt "session"

Usage

KeeonTabrizi commented Nov 22, 2022

KeeonTabrizi commented Dec 9, 2022

dbeatty10 commented Dec 12, 2022

Smaller data sets

Hot data sets

github-actions bot commented Mar 13, 2023

github-actions bot commented Mar 20, 2023

github-actions bot commented Mar 20, 2023

KeeonTabrizi commented Mar 20, 2023

dongchris commented Mar 23, 2023

KeeonTabrizi commented Mar 24, 2023

jtcohen6 commented Mar 24, 2023 • edited Loading

dongchris commented Mar 24, 2023

KeeonTabrizi commented Mar 24, 2023

jtcohen6 commented Mar 24, 2023 •

edited

Loading