Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. #6234

Closed
3 tasks done
KeeonTabrizi opened this issue Nov 10, 2022 · 15 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@KeeonTabrizi
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

This feature request is to support using temporary tables (within a single session) instead of CTE references in the execution of a single DBT model.

To be clear as I've seen this confusion, this is not a request to materialize a temporary object - rather utilize a temporary object in the materialization itself.

I believe it would require a model specific over-ride to limit execution to within 1 thread / session (vs. the global thread default in dbt_project.yml. Additionally, the compilation / execution order would first require the execution of the temporary objects followed by the remaining query (which could actually include a CTE). For the examples below I will assume a Snowflake DB.

Currently, DBT can compile a model below:

dbt_model.sql

{{ config(
    alias="your_table",
    materialized="table"
) }}

WITH my_data AS (
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}
)

SELECT ...
FROM my_data

The run target may look something like

CREATE OR REPLACE db.schema.your_table AS (
WITH my_data AS (
      SELECT ...
      FROM db.schema.my_other_dbt_model'
)

SELECT ...
FROM my_data

);

A simple version of this query that would utilize a tempory table could look as follows:

dbt_model_temp_table.sql

{{ config(
    alias="your_table",
    materialized="table"
) }}

CREATE TEMPORARY TABLE my_data AS
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}
;

SELECT ...
FROM my_data

The run target will currently generate invalid SQL as follows:

CREATE OR REPLACE db.schema.your_table AS (
CREATE OR REPLACE my_data
      SELECT ...
      FROM db.schema.my_other_dbt_model'
;

SELECT ...
FROM my_data

);

The compiled version should first execute any create statements and then wrap the final query which generates the result set to materialize:

CREATE TEMPORARY TABLE my_data as
      SELECT ...
      FROM db.schema.my_other_dbt_model'
;

CREATE OR REPLACE db.schema.your_table AS ( --what DBT would materialize
  SELECT ...
  FROM my_data
);

What I believe is needed is an additional config argument into the model to indicate temporary tables are used which would force a single session/thread for the model execution as well as re-order the execution of the model (first executing the temporary objects in the session, and finally generating the results for DBT to materialize).

Describe alternatives you've considered

Converting everything to CTEs or creating individual upstream models is the only option. However, I feel strongly enough in the development / iteration cycles around complex queries/transformation where temporary tables are used that having the option to directly materialize those queries through DBT models will be a win for developers.

Who will this benefit?

This will benefit anyone developing complex queries as well as those already using temporary tables which can improve debugging / speed of query development as results can remain available to an open session.

It allows a developer to move a query > model > production quicker and provides that option versus being forced to convert a query into CTEs and/or break up a query into component models with dependencies.

While, this may be theoretically better data engineering practice it shouldn't be forced by the lack of support for temporary tables.

See this comment which also describes some of the benefit: #2725 (comment)

Are you interested in contributing this feature?

Happy to ideate

Anything else?

Tangential but not directly related issues & PRS:

@KeeonTabrizi KeeonTabrizi added enhancement New feature or request triage labels Nov 10, 2022
@github-actions github-actions bot changed the title [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. [CT-1487] [Feature] Allow DBT Models to Reference Created Temporary Tables prior to final result set & materialization. Nov 10, 2022
@dbeatty10
Copy link
Contributor

@KeeonTabrizi thanks for a thorough write-up and links to tangential issues and PRs!

We are always interested to discuss opportunities to improve debugging / speed of query development.

Question:

  • is this feature request primarily intended for use during iterative development? Or is it intended for deployment to production as well?

@KeeonTabrizi
Copy link
Author

@dbeatty10 thanks for the comment. It's definitely intended for production use.

The iterative development cycle can yield something that is production ready through these temporary objects and it's helpful to be an option. The alternative is to force translation of these objects back into CTE's and add more over-head to the process and ultimately lengthen the time to get something out the door when it was otherwise ready to ship.

@dbeatty10
Copy link
Contributor

Gotcha @KeeonTabrizi

What are you trying to optimize for in the iterative development cycle?

  1. Having all the logic you are currently working on within the same file
  2. Re-use within multiple downstream models
  3. Persisting a large data set to save time of re-materializing it during each dev re-build/re-run cycle
  4. Not persisting certain data sets beyond the end of a build (presumably to save storage costs)
  5. Something else?

Options

As you're probably well-versed, there's a variety of options for developing dbt models including:

  • table materialization (permanent or transient)
  • ephemeral materialization
  • view materialization
  • explicit CTE within one of the above

Let's explore what each of those are optimized for:

table ephemeral view explicit CTE
A) Having all the logic you are currently working on within the same file
B) Re-use within multiple downstream models
C) Persisting a large data set to save time of re-materializing it during each dev re-build/re-run cycle 💡
D) Not persisting certain data sets beyond the end of a build (presumably to save storage costs) 💡

We can see in the chart above that A) and B) are generally opposites, as are C) and D). 'Tis the nature of trade-offs!

Table materialization

Optimized for B) and C)

💡 But we can unlock D) by mimicking the way Snowflake temporary tables behave in practice!

You'd just add drop table if exists statements as on-run-end hooks if you don't want the table to stick around beyond a full dbt run/build.

The lifespan of these "temp" tables would limited to a dbt "session" (spanning Snowflake connections) rather than just a single Snowflake connection session.

Ephemeral and view materializations

Optimized for B) and D)

Explicit CTE

Optimized for A) and D).

💡 Although they are not optimized for C), adding a limit X during development can reduce the size of the data set and speed up cycle times.

Example

"Temp" tables within a dbt "session"

my_data.sql

{{ config(
    materialized="table"
) }}

select ...
from {{ ref('my_other_dbt_model') }}

your_table.sql

{{ config(
    materialized="table"
) }}

select ...
from {{ ref('my_data') }}

dbt_project.yml

...
# on-run-end:
#   - "drop table if exists {{ ref('my_data') }};"

☝️ Uncomment above once you want it to act like a "temp" table in production. Leaving it commented will allow it to persist across sessions during development.

Usage

Rebuilding just the downstream table (leaving the "temp" table as-is):

dbt build -s your_table

Rebuilding both the "temp" table and its downstream consumer:

dbt build -s my_data your_table

@KeeonTabrizi
Copy link
Author

@dbeatty10 many thanks for the detailed response. I will make sure to review it in detail and respond - the week is a busy with holidays but please give me a bit and I will make sure to provide responses.

@KeeonTabrizi
Copy link
Author

@dbeatty10 hope my responses help. I'd also be happy to set up a quick 15 minute chat if that helps.

What are you trying to optimize for in the iterative development cycle?

  1. Having all the logic you are currently working on within the same file
  2. Re-use within multiple downstream models
  3. Persisting a large data set to save time of re-materializing it during each dev re-build/re-run cycle
  4. Not persisting certain data sets beyond the end of a build (presumably to save storage costs)
  5. Something else?

With regard to the scenarios you described above #1 and #3 are relevant. #2 and #4 are not part of my considerations.
Overall, I do not believe "by mimicking the way Snowflake temporary tables behave in practice" is practical for the development cycle I'm thinking of.

Consider a situation where one needs to build a model where likely there is a need draft a complex query (many dependencies, many transformations) and high computation / run-time query. Let's also consider other real-world considerations where the time is limited and it's not feasible to build the best most streamlined DBT pipeline to achieve the model.

Often times when faced with such a task it is unknown what the ideal composite pieces of data are to achieve the end result. It is often a series of trial and error and takes multiple iterations to understand how to break the data up into logical components/transformations. Temporary table (instead of CTEs) allows one to keep data hot/available within a session so you can iterate on the query quickly. The end result is really no different than using CTEs. Most complex queries could also start with a series of CTEs. DBT has no issues with this, I don't think this is a question of best practices in analytics engineering it's just about supporting native functions of a database in the generation of data. In some databases like Redshift, if your sort keys etc. are not well defined to how the data is used using CTEs can have major slowdowns, but simply declaring temp tables of the equivalent CTE and asking redshift to ANALYZE the temporary objects before downstream use can well outperform the equivalent query by CTEs.

@dbeatty10
Copy link
Contributor

Although not at all what we would recommend, the thing you described is actually possible via pre-hooks.

In your original description, you already explained how this doesn't work as-is:
your_table.sql

{{ config(
    materialized="table"
) }}

CREATE TEMPORARY TABLE my_data AS
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}
;

SELECT ...
FROM my_data

But if you just add a pre_hook configuration and shuffle the order of content slightly as seen below, it will work.

⚠️ Note: We'd consider using pre-hooks in this way an anti-pattern and not what we'd recommend!⚠️

Instead, we'd recommend putting that selection logic in a CTE, view, ephemeral, or table materialization instead. If you need to drop tables to save storage costs, then we'd recommend using an on-run-hook as described in the example above.

your_table.sql

-- ⚠️ Using pre-hooks in this way is an anti-pattern! ⚠️
{{ config(
    materialized="table",
    pre_hook="

CREATE TEMPORARY TABLE my_data AS
      SELECT ...
      FROM {{ ref('my_other_dbt_model') }}"
) }}

SELECT ...
-- depends_on: {{ ref('my_other_dbt_model') }}
FROM my_data

See here for an explanation why the dependency might need to be forced via -- depends_on: {{ ref('my_other_dbt_model') }}

Smaller data sets

For scenarios with a lot of trial & error iteration on a large data set, we’d recommend limiting the data like this or with a limit clause or using Snowflake’s tablesample functionality. Then you can rip it out once you're done iterating (or leave it in for future developers within a target.name == '...' filter).

Hot data sets

But what about the goal of keeping a data set hot/available so you can iterate on a transformation query quickly?

A regular table materialization is the exemplary way to persist the result of a transformation within a dbt session or across dbt sessions.

But let's consider using a pre_hook to define a temp table instead. In this case, the data will disappear in between iterations and it will be re-built every time you run dbt. If the temp table takes a non-trivial amount of time to build, then the time losses will add up over many iterations. With the table materialization however, it would only incur a one-time cost until you re-run that particular upstream node. So using inline temp tables in a dbt model for the purpose of rapid development really has me scratching my head. 🤔

The only explanation I can think of:

  • Maybe you are doing all your development cycles using temp tables outside of dbt and then copy-pasting it into a dbt project once you are done? And then you're trying to make the minimum number of changes to the SQL after pasting?

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Mar 13, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

1 similar comment
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2023
@KeeonTabrizi
Copy link
Author

Oh no closed!

At the end of the day I believe this is a bad outcome to remain unsupported.

If one can build a model with a CTE you should be able to build it with a temp table 1:1.

If you need to get a model out the door and you are spending too much time waiting for results from your CTE you shouldn't be forced to break up the model early in it's development into materialized precursors nor should you be forced to have to convert your temp table work at the very end to because temp tables are not supported. There are real business constraints and timelines that not having this flexibility can penalize.

I understand your argument with respect to best practices but ultimately it's simply a function of lack of support for a normal database mechanism in the name of a best practice, so I have a fundamental disagreement. I understand there's a whole other can of technical worms with respect to sessions/threading that get's introduced but that is a different conversation.

@dongchris
Copy link

Would love this feature as well! :)

@KeeonTabrizi
Copy link
Author

@dbeatty10 can we open this back up

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 24, 2023

@dongchris @KeeonTabrizi I think @dbeatty10 did a nice job above laying out the rationale for why dbt both cannot and does not support this, for reasons of technical limitation (how dbt manages database connections) and best practice (code & workflow in dev should closely mirror the code that will be actually be running in prod), respectively.

In the meantime, if you really want this support, I'd recommend you try out Doug's suggestion above, where you pass your "CTE" into a pre-hook that creates it as a temp table. You could even macro-tize this logic, with create if not exists, so as to reuse it across your models, and avoid rebuilding if it already exists in your current session. Again, we won't be adding this into dbt-core natively, for all the reasons Doug articulated in his comments above.

@dongchris
Copy link

Thanks @jtcohen6 for the explanation. If we use the pre hook way, what if there are actual prehooks running, would having it in the config this way override existing ones?

@KeeonTabrizi
Copy link
Author

@jtcohen6 -

@dbeatty10 did indeed do a great job RE: rational - but I don't believe it really was about the technical reasons it is not supported - perhaps implied, but the discussion and rationale was squarely on the analytic engineering side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

4 participants