Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1467] [CT-1466] [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure #306

Closed
2 tasks done
mzayer-coveo opened this issue Nov 2, 2022 · 14 comments · Fixed by #379
Assignees
Labels
bug Something isn't working

Comments

@mzayer-coveo
Copy link

mzayer-coveo commented Nov 2, 2022

Is this a regression in a recent version of dbt-core?

  • I believe this is a regression in dbt-core functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

We are using Prefect to run several instances of the same dbt-snowflake project with different variables in parallel. Almost randomly, the dbt task fails with this error:
SQL compilation error: Object 'XYZ__DBT_TMP' does not exist or not authorized.
Why?
The way dbt v <= 1.2 works is that it first creates a temporary table based on the dbt model like:
Create or replace temp table XYZ__dbt_tmp
And then merges this temporary table into the target table.
Temporary tables in snowflake are confined to the specific session in which they are created. If session 1 creates a temp table XYZ, session 2 can’t see temp table XYZ, so theoretically, infinite temp tables with the same name can coexist in parallel sessions without interfering with each other. When the session ends, the temporary table gets destroyed.
However, dbt v1.3 has replaced temporary tables with views. So it first creates a view like:
Create or replace view XYZ__dbt_tmp
And then merges the view into the target table, and finally drops the view like:
Drop view if exists XYZ__dbt_tmp
Views ARE NOT confined to sessions, so several parallel sessions trying to create views with the same name interfere with each other. This means that session 1 does create or replace view XYZ__dbt_tmp, session 2 does create or replace view XYZ__dbt_tmp, session 1 finishes the merge using the view created by session 2 and does a drop view if exists XYZ__dbt_tmp, and when session 1 wants to merge the XYZ__dbt_tmp view, it doesn’t exist anymore, hence we get the error
SQL compilation error: Object 'XYZ__DBT_TMP' does not exist or not authorized.
We use session variables to limit sessions to specific datasets. This behavior also causes data leak.

Expected/Previous Behavior

Expected behavior is for the views to show the same behavior as temp tables (being confined to a session). For example, suffixing the view name with the session id to differentiate between the view in different sessions.

Steps To Reproduce

This is not easy to reproduce, but if you trigger several runs of the same dbt project, and the timing is right (drop view in session 1 happens before merge in session 2), you can recreate it.

Relevant log output

No response

Environment

- OS: Irrelevant
- Python: 3.9
- dbt (working version): <=1.2
- dbt (regression version): 1.3

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@mzayer-coveo mzayer-coveo added bug Something isn't working triage labels Nov 2, 2022
@github-actions github-actions bot changed the title [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure [CT-1466] [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure Nov 2, 2022
@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 2, 2022

Thanks for opening @mzayer-coveo. Leaving some quick notes below:

  • Moving this to dbt-snowflake, since it's related to the code there
  • Caused by the change in Remove use of temp table for Snowflake incremental merge #93, with the goal of significantly speeding up merge incremental models
  • It's officially unsupported / undocumented behavior to run the same dbt model multiple times concurrently

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Nov 2, 2022
@github-actions github-actions bot changed the title [CT-1466] [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure [CT-1467] [CT-1466] [Regression] v1.3: views instead of temp tables cause parallel sessions interfere: possible data leak and failure Nov 2, 2022
@Fleid
Copy link
Contributor

Fleid commented Nov 4, 2022

Hi @mzayer-coveo, could you please tell me more about your intended use case?

I can't figure out why you would want to run the same project multiple times in parallel?
Are you trying to multi-thread your data flow at the dbt level? Like processing one partition of data on each parallel run?

@Fleid Fleid self-assigned this Nov 4, 2022
@mzayer-coveo
Copy link
Author

mzayer-coveo commented Nov 4, 2022

Hi @mzayer-coveo, could you please tell me more about your intended use case?

I can't figure out why you would want to run the same project multiple times in parallel? Are you trying to multi-thread your data flow at the dbt level? Like processing one partition of data on each parallel run?

We have a multi-tenant architecture where each table contains the data of multiple tenants. To avoid accessing/mixing data of more than 1 tenant, we have views that do select * from multi_tenant_table where tenant_id = $tenant_id_variable. And subsequent models use that filtered view in their ref{}, instead of the table. So each table has its own filtered view, and the view is the only way to access it.
Then, in the dbt model, we have a sql_header that does set tenant_id_variable = '" ~ var('tenant_dbt_variable') ~ "';
In this way, we can invoke several instances of the same projects using dbt run --vars '{tenant_dbt_variable: tenant_1}'; ``dbt run --vars '{tenant_dbt_variable: tenant_2}'` etc, while remaining confident that even if sql queries are wrong, it's not possible to mix multiple tenants data. This design relies on sessions being independent which was satisfied by the use of temporary tables and snowflake session variables (both of which are bound to a session).

@Fleid
Copy link
Contributor

Fleid commented Nov 15, 2022

That makes sense. It still goes against the expectation that a single model should not be run multiple times concurrently. But it feels like a valid use case.

Anyway, if you don't have hard performance constraints - the reason why we switched methods in 1.3 - have you tried using the delete+insert strategy to force the incremental macro to use temporary tables instead of views?

@jtcohen6
Copy link
Contributor

One quick update here: It sounds like, for some users/tables, the switch to views over temp tables is actually a performance regression (#323). If that's the case, we may wish to make this an opt-in configuration instead. In the meantime, it's possible to switch back to using temp tables in your own project by reimplementing one macro:

{% macro dbt_snowflake_get_tmp_relation_type(strategy, unique_key, language) %}
    -- always table
    {{ return('table') }}
{% endmacro %}

@mzayer-coveo
Copy link
Author

Thanks for your replies. Both options are possible workarounds that we have considered. But we preferred to stick with V1.2 for now.
About JCohen's comment on performance, the use of temp tables (vs views) will also be an improvement in performance if several processes try to update the same table in snowflake because Snowflake locks an entire table in update/merge statements. By doing the processing (which is more time-consuming) in the temp table, we can compute several update processes on the same table in parallel and only the merge statements will be in trouble because of locking, but merges are relatively fast.

@jamesohare84
Copy link

jamesohare84 commented Dec 16, 2022

Hi

We think this caused a serious failure in our overnight run

We had this error on 53 different models in one dbt run

View 'DATABASE.SCHEMA.MODEL_NAME__DBT_TMP' does not exist or not authorized.

We are struggling to make sense of what actually happened. There were no concurrent processes running and the model in the error ran successfully. We are implementing the above macro now to prevent it happening again (we were using 1.3.1)

@jeremyyeo
Copy link
Contributor

Just to add to the parallel execution of models... it could occur when users have say 2 jobs that so happen to include the same incremental model:

  • Run every hour.
  • Run every 6 hours.

So the hourly job will encounter this issue every 6 hours (well depending on which job kicks off first and starts creating the temp view first / drops the temp view first).

@margushein
Copy link

margushein commented Jan 6, 2023

dbt v <= 1.2
Create or replace temporary table XYZ__dbt_tmp

dbt v >= 1.3
Create or replace view XYZ__dbt_tmp

to make parallel running possible, temporary view should be created instead of normal view
Create or replace temporary view XYZ__dbt_tmp

Possible solution:
in adapters.sql
{% macro snowflake__create_table_as(temporary, relation, compiled_code, language='sql') -%}
{% macro snowflake__create_view_as(relation, sql) -%}

to add temporary parameter and code logic into macro: snowflake__create_view_as(...) similarly how it is in snowflake__create_table_as(...)

like this:
{% macro snowflake__create_view_as(temporary,relation, sql) -%}

many cases there are multiple incremental merge processes executed in parallel like, and in dbt v >= 1.3 it is not possible anymore:
select a, b from myDB1.myschema.table_a -> myDB.myschema.daily_kpi
select a, b from myDB2.myschema.table_b -> myDB.myschema.daily_kpy
select a, b from myDB3.myschema.table_c -> myDB.myschema.daily_kpy
select a, b from myDB4.myschema.table_d -> myDB.myschema.daily_kpy

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 6, 2023

Temporary views?? On Snowflake?? Apparently so: https://community.snowflake.com/s/article/How-To-Use-A-Temporary-View-In-Snowflake

(As noted in that article, the word temporary does not appear anywhere here, as of current writing: https://docs.snowflake.com/en/sql-reference/sql/create-view.html)

@VersusFacit
Copy link
Contributor

VersusFacit commented Jan 8, 2023

I wrote a bash script that triggers this behavior ✨ reliably ✨ ! I love an excuse to demo Bash's bizarre built in parallelized process management capabilities 👩‍🔬

  • this isn't guaranteed to always work, but on five out of five test invocations, this triggered the bug on my local machine
  • 1 thread set in my profile
  • orders refers to our trusty Jaffle shop project's orders
    • just set it to an incremental model with default behavior.

image

#!/bin/bash -x

# run job in the "background" where all such processes execute in parallel
dbt run -s orders &
# capture the process id for stopping it: this is to force the regression
job_one=$!

# let the job run until it hits the table creation step but before the drop table step, number depends on your machine
sleep 8
kill -STOP $job_one

# don't progress the script _too_ fast, again variable to your system
sleep 4

# let it spin
dbt run -s orders

# resume the original job, it'll fail some percentage of the time
kill -CONT $job_one

===

And with that, a fix is up!

@margushein
Copy link

@VersusFacit , true, you can not run it like that in parallel.

I have multitenant stage setup (DB1, DB2...) and running in parallel commands like this:

dbt run -s order -target DB1
dbt run -s order -target DB2

  1. merge from "select a, b from myDB1.myschema.stg_orders" INTO myDB.myschema.orders
  2. merge from "select a, b from myDB2.myschema.stg_orders" INTO myDB.myschema.orders

DBT creates Temp table (and view) in myDB, and not in myDB1 or myDB2.

It is a breaking change if table was created as temporary and now view is not created as temporary.

@VersusFacit
Copy link
Contributor

VersusFacit commented Jan 9, 2023

Hey @margushein 👋 thanks for the comment. Let's see if I got everything you wanted me to.

you can not run it like that in parallel

Just to be clear, the intent of my provided script is to document a way to trigger the underlying regression reliably since this is a super nasty bug to catch organically. That way, I could be confident adding the temporary predicate to view creation did fix the problem work (in my testing, it does!)

Re: "parallel." 🤔 Correct me if I'm wrong: I believe you use that term to describe simultaneous dbt invocations across different source databases that target the same relation. Where my test case differs is in execution: I set two parallel executions of the same model in motion. Everything just happens to live in the same database! As it turns out, my use case, like yours, will force two processes to depend on the same _tmp view.

Here's another single-tenant parallelized dbt run example that produces the error less reliably.

for i in {1..50}
do
  dbt run -s orders & # & means send the command to the "background" where processes execute in parallel
done

My hunch is that single/multi-tenant architecture is not the root of the problem. You can trigger the behavior on either a single or multi tenant architecture. It's just extremely unlikely (and ill-advised) someone would run into this in a single tenant setup [I myself had to add sleep commands to force the regression artificially].

So, while I'm not running this "in parallel" database-wise, I am running my test cases "in parallel" process-wise. Right now, I'm confident that's sufficient for triggering this regression. For what it's worth, I did some pretty long-winded testing on before-after adding temporary over in the PR linked in this issue. 🚀


To summarize:

  • In my examples, two or more dbt run -s order & (myDB.mySchema.orders) invocations will erroneously work on the same myDB.mySchema.orders__dbt_tmp.
  • In your example, dbt run -s order -target DB1 and dbt run -s order -target DB2 will erroneously work on the same myDB.mySchema.orders__dbt_tmp

The regression is caused by two different processes depending on the same myDB.mySchema.orders__dbt_tmp view. One process drops the view. The other process tries to drop the same view and fails.

🙏 Hence, I'm pretty sure I've covered your use case in my testing/solution.

❓ Does that make sense to you? Any lingering concerns?

@margushein
Copy link

@VersusFacit , all looks good!

Checked the pull request also. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
7 participants