Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1507] create config for incremental models to use temp tables or views to fix regression in perf #389

Merged
merged 13 commits into from
Jan 17, 2023

Conversation

McKnight-42
Copy link
Contributor

@McKnight-42 McKnight-42 commented Jan 11, 2023

resolves #323

Description

create a model level config for incremental models to allow choice between using views or temp tables during merge strategy to help with performance issues in some cases.

co-author: @dbeatty10

Checklist

@McKnight-42 McKnight-42 added the Skip Changelog Skips GHA to check for changelog file label Jan 11, 2023
@McKnight-42 McKnight-42 self-assigned this Jan 11, 2023
@cla-bot cla-bot bot added the cla:yes label Jan 11, 2023
@McKnight-42 McKnight-42 requested review from Fleid and VersusFacit and removed request for nathaniel-may January 12, 2023 15:21
@McKnight-42 McKnight-42 removed the Skip Changelog Skips GHA to check for changelog file label Jan 12, 2023
@McKnight-42
Copy link
Contributor Author

Want to flag this interesting suggestion @dbeatty10 shared with me here #323 (comment) as part of the review discussion. @Fleid any thoughts on this?

Copy link
Contributor

@colin-rogers-dbt colin-rogers-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested an alternate, maybe cleaner conditional setup but otherwise LGTM

{{ return('view') }}
{% else %} {#-- play it safe -- #}
{{ return('table') }}
{% if language == "sql" and strategy in ("default", "merge") and merge_tmp_relation_type == "table" %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify and do the if language == "sql" condition once?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this maybe? not tested or anything...

Suggested change
{% if language == "sql" and strategy in ("default", "merge") and merge_tmp_relation_type == "table" %}
{% if language == "sql"%}
{% if strategy in ("default", "merge")%}
{% if merge_tmp_relation_type == "table" %}
{{ return("table") }}
{% else %}
{{ return("view") }}
{% elif strategy == "append" %}
{{ return("view") }}
{% elif unique_key is none %}
{{ return("view") }}
{% else %}
{{ return("table") }}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colin-rogers-dbt What do you think about alternative pseudo code for doing the language condition only once?

It's largely the same as what you proposed. The main difference is that it inverts the sql condition so that it doesn't need any nested conditionals.

if language != "sql":
    return "table"
else if strategy in ('default', 'merge') and merge_tmp_relation_type == "table":
    return "table"
else if strategy in ('default', 'merge') and merge_tmp_relation_type == "view":
    return "view"
else if strategy in ('default', 'merge', 'append'):
    return "view"
else if unique_key is none:
    return "view"
else:
    return "table"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The != logic is definitely much cleaner

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if merge_tmp_relation_type == "table" would we ever return "view"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so this is actually the crux of the change we are making, trying to offer older method of using table via a config option so as long as they are saying its a merge method and specify their tmp_relation type I think we are saying they are actively choosing then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think we can actually simplify this further:

if language != "sql":
    return "table"
else if strategy in ('append'):
    return "view"
else if merge_tmp_relation_type == "table":
    return "table"
else if merge_tmp_relation_type == "view":
    return "view"
else if unique_key is none:
    return "view"
else:
    return "table"

Copy link
Contributor Author

@McKnight-42 McKnight-42 Jan 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think that would make the fact that merge_tmp_relation_type only technically applies to the merge/default strategy as they are the same in snowflake? I think we wanted to break it out to more readable so users could figure out their configuration needs if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colin-rogers-dbt Love the way you are trying to make the logic as simple as possible! We just need to make sure to cover each of the moving pieces.

My understanding is that dbt-snowflake supports three different incremental strategies:

  1. merge (which is the default)
  2. append
  3. delete+insert

Something to point out: there's the largely implicit case when strategy == "delete+insert" to consider. In the current version of the PR, it comes into play when all other strategies have already been considered and unique_key is none. If your suggestion is implemented as-is, then the configuration of merge_tmp_relation_type would affect merge and delete+insert, but it wouldn't affect append.

Leaning into your idea, I suspect it would be like:

if language != "sql":
    return "table"
else if strategy in "append":
    return "view"
else if strategy == "delete+insert" and unique_key is none:
    return "view"
else if merge_tmp_relation_type == "table":
    return "table"
else if merge_tmp_relation_type == "view":
    return "view"
else:
    return "table"

Or a more future-proof version presuming some other strategy might be added at a later date:

if language != "sql":
    return "table"
else if strategy in "append":
    return "view"
else if strategy == "delete+insert" and unique_key is none:
    return "view"
else if strategy in ("default", "merge") and merge_tmp_relation_type == "table":
    return "table"
else if strategy in ("default", "merge") and merge_tmp_relation_type == "view":
    return "view"
else:
    return "table"

Side note: here is a proposal to make both the logic and the documentation as simple as possible. It would need buy-in from @Fleid. That proposal is non-blocking for this PR though.

{{ return('table') }}
{% if language != "sql" %}
{{ return("table") }}
{% elif strategy in ('default', 'merge') and merge_tmp_relation_type == "table"%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jinja formatting nit to add missing whitespace:

Suggested change
{% elif strategy in ('default', 'merge') and merge_tmp_relation_type == "table"%}
{% elif strategy in ('default', 'merge') and merge_tmp_relation_type == "table" %}

@@ -10,10 +10,18 @@
for faster overall incremental processing.
#} */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@McKnight-42 we should update the comments in L3-L11 to align with our changes.

Maybe something generally like the following? There's lots of different ways we could write the comments here. A key thing to communicate is why different strategies use either a view or a temporary table. It would be nice if we make it explicit rather than having to do detective work, especially in the case where unique_key is none.

(I didn't bother to limit the line lengths below while we are iterating on the exact wording, but we'll want to add appropriate new lines in the final commit.)

  /* {#
       High-level principles:
       If we are running multiple statements (DELETE + INSERT),
       and we want to guarantee identical inputs to both statements, then
       we _must_ first save the model query results as a temporary table (which presumably comes with a performance cost).
       If we are running a single statement (MERGE or INSERT alone),
       we _may_ save the model query definition as a view instead,
       for (presumably) faster overall incremental processing.

       Low-level specifics:
       Languages other than SQL (like Python) will use a temporary table.
       With the `default` strategy of `merge`, the user may choose between a temporary table and view (defaulting to view).
       The `append` strategy can use a view because it will run a single INSERT statement.
       When the `unique_key` is none, then we can use a view because it will run a single INSERT statement.
       Otherwise, play it safe by using a temporary table.
  #} */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! will push up the change to the wording asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @McKnight-42 !

Could you choose a maximum line length for the comments and then apply it? It looks like it was using a max line length of 70 previously. Using either 80 or 88 seems reasonable also.

{{ return("view") }}
{% elif strategy in ('default', 'merge', 'append') %}
{{ return("view") }}
{% elif unique_key is none %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversation here makes me think we should make this update in order to future-proof this logic:

Suggested change
{% elif unique_key is none %}
{% elif strategy == 'delete+insert' and unique_key is none %}

Otherwise, if some new strategy is added that uses multiple statements in its implementation (like a delete step and an insert step), it might get handled incorrectly accidentally when unique_key is none.

@McKnight-42 McKnight-42 merged commit 45948bd into main Jan 17, 2023
@McKnight-42 McKnight-42 deleted the mcknight/ct-1507 branch January 17, 2023 21:33
colin-rogers-dbt pushed a commit that referenced this pull request Jan 19, 2023
…views to fix regression in perf (#389)

* init push for pr to work on regression in temp tables vs views for dbt-snowflake

* adding additional table selection logic, functionality works locally based off snowflake query history

* extend condition logic to smaller more readable elif's take into account some lost table options

* add changie

* breaking out language selection for more readability, reorganizing conditional flow

* typo fix

* spacing fix

* change messaging for incremental model strategy selection

* add direct reference to delete+insert

* changing comment formatting and structure
colin-rogers-dbt added a commit that referenced this pull request Jan 19, 2023
* Mp/reduce ci daily run intervals (#392)

* Only run at 7:00 am est (before folks get into the office)

* Split jobs and adjust timestamps.

* Trim comments.

* Need both workflows on each job.

Co-authored-by: Mila Page <[email protected]>

* [CT-1507] create config for incremental models to use temp tables or views to fix regression in perf (#389)

* init push for pr to work on regression in temp tables vs views for dbt-snowflake

* adding additional table selection logic, functionality works locally based off snowflake query history

* extend condition logic to smaller more readable elif's take into account some lost table options

* add changie

* breaking out language selection for more readability, reorganizing conditional flow

* typo fix

* spacing fix

* change messaging for incremental model strategy selection

* add direct reference to delete+insert

* changing comment formatting and structure

* Convert incremental on_schema_change tests (#398)

* Convert incremental on_schema_change tests

* Switch to dbt-core main

* [CT-1507] part 2: Expand configuration to all strategies (#403)

* init push for pr to work on regression in temp tables vs views for dbt-snowflake

* adding additional table selection logic, functionality works locally based off snowflake query history

* extend condition logic to smaller more readable elif's take into account some lost table options

* add changie

* breaking out language selection for more readability, reorganizing conditional flow

* typo fix

* spacing fix

* change messaging for incremental model strategy selection

* add direct reference to delete+insert

* changing comment formatting and structure

* expand tmp_relation_type to be configurable for all incremental options

* re organize conditionl logic, add leading condtional for exceptions

* typo

* seperate if statements

* fix exception blocking

* seperate execptions condtional into two small condtionals to prevent a always case

* change conditional logic based on language

* change delete+insert exception logic, and messages for both exceptions

* change condition for more jinja language

* remove default value assigned to tmp_relation_type, change delete+insert excpetion to only kick off if view is assigned and not unique_key not none

* reword python exception

* minor comment change, revert some condtional logic (need to deal with none) value

* re add assignment of config

* comment changes

* put back low level part about append strategy

* minor change to workign for excpetion

* add changelog entry

* fix previous changelog entry

* change change log entry, remove some config from integration test

* remove unnded comma (#405)

Co-authored-by: Mila Page <[email protected]>
Co-authored-by: Mila Page <[email protected]>
Co-authored-by: Matthew McKnight <[email protected]>
Co-authored-by: Jeremy Cohen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1507] [Regression] v1.3: views instead of temp tables cause performance issue
3 participants