[CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue #323

asouletdebrugiere · 2022-11-15T10:47:05Z

Is this a regression in a recent version of dbt-snowflake?

I believe this is a regression in dbt-snowflake functionality
I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

A change has been introduced in dbt 1.3.0 that uses a view instead of a temporary table for incremental models when the merge incremental strategy is used.
I noticed this change because one of my hourly job is now taking 30 minutes vs 3 minutes previously.
I know that using the delete+insert strategy would work to get temp tables back but we did some tests and the performance of our jobs is better when we use merge.

Expected/Previous Behavior

Would it be possible to go back to temporary table instead of views ? Or at least to add a parameter to be able to use temporary tables when the incremental strategy is a merge here ?

Steps To Reproduce

Install dbt-core and dbt-snowflake v1.3
Launch an incremental model with the merge incremental_strategy
See that the query launched in Snowflake uses create or replace view vs create or replace temporary table in v<=1.2
Compare the query time of the merge between the view and the temporary table

Relevant log output

No response

Environment

- OS: Irrelevant
- Python: 3.9
- dbt-core (working version): <=1.2
- dbt-snowflake (working version): 1.3
- dbt-core (regression version): <=1.2
- dbt-snowflake (regression version): 1.3

Additional Context

No response

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2022-11-15T14:27:33Z

@asouletdebrugiere This is surprising to hear! The entire justification for #93, which switched out temp tables for views, was to improve performance.

You can switch it back by reimplementing the macro named dbt_snowflake_get_tmp_relation_type in your own project. The purpose of this macro is to determine which type of intermediate relation to use:

{% macro dbt_snowflake_get_tmp_relation_type(strategy, unique_key, language) %}
    -- always table
    {{ return('table') }}
{% endmacro %}

If you confirm that switching back to temp tables speeds up your incremental runs significantly, then we should preserve both behaviors, and make this an opt-in config instead of an across-the-board default. I'd also like to figure out why this is faster for some tables, and slower for others.

asouletdebrugiere · 2022-11-15T14:37:39Z

Thanks @jtcohen6 for your quick answer ! Overriding the dbt_snowflake_get_tmp_relation_type macro is exactly what I did to fix my issue and my query time came back to 3 minutes vs 30 minutes during the issue.
So I agree this should be opt-in. Do you want me to make the change ?
Unfortunately I can't share with you the tables I'm processing because it's sensitive data but I'd be happy to give you more info if needed.

NiallRees · 2022-11-17T12:35:28Z

@asouletdebrugiere It would be very helpful to see the query profiles for
Tmp view:

The slow 30 minutes query that performs the merge and select from view

Tmp table:

The query that produces the tmp table
The query that performs the merge and select from table

asouletdebrugiere · 2022-11-21T09:51:34Z

Hi Nial,

Here it is :

Tmp view:

The slow 30 minutes query that performs the merge and select from view

Tmp table:

The query that produces the tmp table
The query that performs the merge and select from table

You see that the view is causing a massive spill to disk on a medium warehouse.

NiallRees · 2022-11-21T11:16:45Z

Thanks @asouletdebrugiere - what's interesting is that the first query is scanning 11 billion rows from the target table on the final merge, whereas the merge from tmp table query is only scanning 520 million. It seems like the merge from tmp table is creating a join filter which is pushed down onto the target table scan, but the merge from view is not creating that join filter on the target table scan, which is causing the massive disk spillage in the join step.

Join filter (from screenshot 3):

No join filter (from screenshot 1):

Would you be able to send screenshots of:

From the first screenshot:
a. The information in the right hand section when highlighting this node

b. The information in the right hand section when highlighting this node

2 From the third screenshot:
a. The information in the right hand section when highlighting this node:

b. The information in the right hand section when highlighting this node:

c. The information in the right hand section when highlighting this node:

asouletdebrugiere · 2022-11-21T12:39:23Z

Yes sure :

From the first screenshot:

a.

b.

Full "Equality Join Condition" is based on a surrogate key we created for the table :

(DBT_INTERNAL_DEST.EXTRACTION_ID =
MD5(
(IFNULL(TO_CHAR(TRY_TO_NUMBER(SPLIT_PART(DOCUMENT_ORCHESTRATOR.EXTERNAL_ID, '_', 1))), '')) || '-' ||
(IFNULL(TO_CHAR(IFF(DOCUMENT_ORCHESTRATOR.REGION = 'eu-west-1', 1, IFF(DOCUMENT_ORCHESTRATOR.REGION = 'us-east-1', 2, null))), '')) || '-' ||
(IFNULL(DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS_VAL_DOC_FIELDS.FIELD_EXTRACTION_SOURCE, '')) || '-' ||
(IFNULL(DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS_VAL_DOC_FIELDS.FIELD_NAME, '')) || '-' ||
(IFNULL(DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS_VAL_DOC_FIELDS.VALUE, '')) || '-' ||
(IFNULL(IFNULL(NULLIF(DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS.DOCUMENT_TYPE__SIDE, ''), DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS.MEDIA__MEDIA_TYPE), '')) || '-' ||
(TO_CHAR(DOCUMENT_ORCHESTRATOR.TIMESTAMP))
)
)

From the third screenshot:

a.

b.

c.

jtcohen6 · 2022-12-01T17:57:58Z

It sounds like there's merit in making this behavior opt-in, since there are cases where it's better and cases where it's worse.

Questions for us are:

What to name the config: tmp_relation_type: table|view?
What the default should be: keep 'view' (change in v1.3), or revert to 'table' (v1.2 and previous)

In the meantime, users can opt into the previous view behavior by overriding the dbt_snowflake_get_tmp_relation_type macro, as in my comment above

JeremyNadal33320 · 2022-12-02T12:24:50Z

Hi there,
Happy to find this. I have the exact same issue and as @jtcohen6 described, the new version is faster for 90% of my incremental models. Unfortunately, it slows down the 10% left by a factor 11~13.

Therefore, I feel like the default behavior should be reverted to 'table' to avoid misunderstood long runs until the matter is better understood. Also, the naming sounds right.

As usual, you all rock, thanks for the time!

colin-rogers-dbt · 2022-12-02T17:55:32Z

My inclination would be to say the default should be whatever is the more performant in the majority of cases and better document the cases where this default can be problematic.

Fleid · 2022-12-07T22:16:37Z

Ok so I think we're good to move forward with adding a model level config for incremental models using the merge strategy, to switch the temporary relation type from view (default from 1.3) to table.

For the naming, I can think of these options:
- tmp_relation_type : because it's the name of the variable internally - but I actually don't like that for this reason
- incremental_tmp_relation_type : which opens the door for re-using that for other incremental strategies,
- merge_tmp_relation_type : which clarifies our intention, but we would need to enforce the distinction in the logic gate :
  - FROM : strategy in ('default', 'append', 'merge')
  - TO : (strategy in ('default', 'append') or (strategy == 'merge' and merge_tmp_relation_type == 'view'))
We keep the default for incremental/merge to view. In the large majority of cases, this is the most performant option. Plus, hopefully the performance hit on the remaining cases is transient and will be solved by Snowflake.

Any preference @jtcohen6 ? I think I like merge_tmp_relation_type better.

Fleid · 2022-12-07T22:49:31Z

May be relevant to this conversation : dbt-labs/dbt-core#5236

jtcohen6 · 2023-01-05T09:52:12Z

Any preference @jtcohen6 ? I think I like merge_tmp_relation_type better.

Sorry I missed this last month — agree with your proposed name!

dataders · 2023-01-06T15:13:51Z

just sharing some thoughts from an internal thread from an optimizer guru at Snowflake from last month.

Screenshots are useful, but still a bit difficult to diagnose performance issues through. That said, the slow plan has the 11B row data set on the left (build) side of the hash-join, which is clearly suboptimal. I expect that the cardinality estimates used by the compiler were off by quite a wide margin, and that is probably due to the fact that this equi-join condition used this complex expression based on computed values instead of a column from the right (probe) side:
(DBT_INTERNAL_DEST.EXTRACTION_ID = MD5(
    (IFNULL(TO_CHAR(
        TRY_TO_NUMBER(SPLIT_PART(DOCUMENT_ORCHESTRATOR.EXTERNAL_ID, '_', 1))), ''
            )) || '-' ||
    (IFNULL(TO_CHAR(
        IFF(DOCUMENT_ORCHESTRATOR.REGION = 'eu-west-1', 1, IFF(DOCUMENT_ORCHESTRATOR.REGION = 'us-east-1', 2, null))), ''
            )) || '-' ||
    (IFNULL(
        DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS_VAL_DOC_FIELDS.FIELD_EXTRACTION_SOURCE, ''
            )) || '-' ||
    (IFNULL(
        DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS_VAL_DOC_FIELDS.FIELD_NAME, ''
            )) || '-' ||
    (IFNULL(
        DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS_VAL_DOC_FIELDS.VALUE, ''
            )) || '-' ||
    (IFNULL(IFNULL(
        NULLIF(
            DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS.DOCUMENT_TYPE__SIDE, ''
        ), DOCUMENT_ORCHESTRATOR_DOC_CHECK_MEDIAS.MEDIA__MEDIA_TYPE), ''
        )) || '-' ||
    (TO_CHAR(
        DOCUMENT_ORCHESTRATOR.TIMESTAMP)
        )
)
Even though this same join expression was probably (?) used in the faster query, in that execution there were only 2 tables (instead of 4), so there were only 2 permutations for join order, and I suspect that it got a bit “lucky” even with the poor cardinality estimates and picked the optimal plan (which puts the 11B row table on the probe side, not the build side). That is my best guess based on the info I am seeing in the DBT Labs report writeup. Does that help?

dbeatty10 · 2023-01-13T15:26:20Z

As @McKnight-42 and I have been iterating on the implementation of this feature, a question came up how we should handle the merge_tmp_relation_type config for strategies other than the default strategy of merge.

Consider the append strategy:

How should it behave if the user specifies merge_tmp_relation_type=table and strategy=append?
- we currently would use a temporary view

Three options

We have three main options:

Raise an exception that the merge_tmp_relation_type configuration doesn't apply to this strategy
(Silently?) ignore the merge_tmp_relation_type configuration
Expand the scope so that a incremental_tmp_relation_type config also applies to the rest of the incremental strategies (but only when provided)

Trade-offs

Primary trade-off of each option (in corresponding order):

Could be frustrating to the user if they can control the type of temporary object for only the merge strategy but not the others.
Could be confusing if they specify table but it uses a view anyways.
A user could override to use a temporary object in a way that is less efficient than the alternative. (i.e., configuring table in a case where view would have been faster.) Also, the discerning eye will see that that language!=sql will still use table no matter what the user says.

Proposal

So I'd like to propose that we actually do the third option! This would expand the scope slightly to cover other incremental strategies also.

One advantage is that writing the documentation for the third option would be the most simple. It would also be easiest to understand from a user perspective:

when this option is provided (for SQL models), then the adapter will do whatever you say
otherwise, then the adapter will do whatever it thinks best

Trying on Option Three

Suppose the new config name is tmp_relation_type. Then the pseudo logic would be:

if language == "sql" and tmp_relation_type == "table":
    return "table"
else if language == "sql" and tmp_relation_type == "view":
    return "view"
else:
    return whatever the current logic returns

As a user, I prefer the name tmp_relation_type over incremental_tmp_relation_type since it is shorter and not redundant with the materialization type of incremental)

Expanding this out to include all the current logic (and simplifying the language a teensy bit):

if language == "python":
    return "table"
else if tmp_relation_type == "table":
    return "table"
else if tmp_relation_type == "view":
    return "view"
else if strategy in ('default', 'merge', 'append'):
    return "view"
else if unique_key is none:
    return "view"
else:
    return "table"

Note that delete+insert is the only strategy not explicitly referenced by name and it will use view unless a unique_key is provided.

Current Logic in PR #389

If we keep the original decision to have a merge_tmp_relation_type config that only applies to the merge strategy, then the logic could be essentially like:

# TODO:
# Either raise an error if `merge_tmp_relation_type` is configured for a strategy other than `default` or `merge`
# or ignore altogether

if language == "python":
    return "table"
else if strategy in ('default', 'merge') and merge_tmp_relation_type == "table":
    return "table"
else if strategy in ('default', 'merge') and merge_tmp_relation_type == "view":
    return "view"
else if strategy in ('default', 'merge', 'append'):
    return "view"
else if unique_key is none:
    return "view"
else:
    return "table"

Ultimately, the two variations aren't that substantially different from each other. However, if we do ever want to come back and allow strategies other than merge to configure their temp object type, then the implementation in code will get uglier, and the documentation will be more complicated.

If we allow all incremental strategies to provide this config in one fell swoop right now, then we'll complete the offering of a balance of rational defaults along with configurability for intermediate temp objects used by incremental models.

asouletdebrugiere added bug Something isn't working triage labels Nov 15, 2022

github-actions bot changed the title ~~[Regression] v1.3: views instead of temp tables cause performance issue~~ [CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue Nov 15, 2022

jtcohen6 added awaiting_response and removed triage labels Nov 15, 2022

jtcohen6 mentioned this issue Nov 15, 2022

[CT-1467] [CT-1466] [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure #306

Closed

2 tasks

dbeatty10 mentioned this issue Nov 18, 2022

[CT-1376] [Bug] Invalid Number Precision error for incremental model in 1.3 dbt-labs/dbt-core#6107

Closed

2 tasks

jtcohen6 added triage and removed awaiting_response labels Nov 29, 2022

jtcohen6 removed the triage label Dec 1, 2022

McKnight-42 mentioned this issue Jan 11, 2023

[CT-1507] create config for incremental models to use temp tables or views to fix regression in perf #389

Merged

6 tasks

McKnight-42 closed this as completed in #389 Jan 17, 2023

dbeatty10 mentioned this issue Jan 17, 2023

[CT-1843] [Feature] Generic tmp_relation_type config for incremental models #402

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue #323

[CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue #323

asouletdebrugiere commented Nov 15, 2022

jtcohen6 commented Nov 15, 2022

asouletdebrugiere commented Nov 15, 2022

NiallRees commented Nov 17, 2022 •

edited

Loading

asouletdebrugiere commented Nov 21, 2022

NiallRees commented Nov 21, 2022 •

edited

Loading

asouletdebrugiere commented Nov 21, 2022

jtcohen6 commented Dec 1, 2022

JeremyNadal33320 commented Dec 2, 2022

colin-rogers-dbt commented Dec 2, 2022

Fleid commented Dec 7, 2022

Fleid commented Dec 7, 2022

jtcohen6 commented Jan 5, 2023

dataders commented Jan 6, 2023 •

edited

Loading

dbeatty10 commented Jan 13, 2023

[CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue #323

[CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue #323

Comments

asouletdebrugiere commented Nov 15, 2022

Is this a regression in a recent version of dbt-snowflake?

Current Behavior

Expected/Previous Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

jtcohen6 commented Nov 15, 2022

asouletdebrugiere commented Nov 15, 2022

NiallRees commented Nov 17, 2022 • edited Loading

asouletdebrugiere commented Nov 21, 2022

NiallRees commented Nov 21, 2022 • edited Loading

asouletdebrugiere commented Nov 21, 2022

jtcohen6 commented Dec 1, 2022

JeremyNadal33320 commented Dec 2, 2022

colin-rogers-dbt commented Dec 2, 2022

Fleid commented Dec 7, 2022

Fleid commented Dec 7, 2022

jtcohen6 commented Jan 5, 2023

dataders commented Jan 6, 2023 • edited Loading

dbeatty10 commented Jan 13, 2023

Three options

Trade-offs

Proposal

Trying on Option Three

Current Logic in PR #389

NiallRees commented Nov 17, 2022 •

edited

Loading

NiallRees commented Nov 21, 2022 •

edited

Loading

dataders commented Jan 6, 2023 •

edited

Loading