Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snowflake: Show terse object running for all models when only calling single model #2673

Closed
4 tasks
brittianwarner opened this issue Jul 30, 2020 · 6 comments
Closed
4 tasks
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core

Comments

@brittianwarner
Copy link

brittianwarner commented Jul 30, 2020

Describe the bug

When only triggering one model via --model <my_model_name> it looks like this is triggering the "show terse objects" for all models where I would expect it only to run for the model specified and all dependencies.

Steps To Reproduce

Trigger DBT Run for one model in a project with multiple models/packages

Expected behavior

I would expect show terse object to only occur for the selected model and all dependencies

Screenshots and log output

dbt run --profiles-dir . --vars '{"src_schema": "64", "target_schema": "64"}' --model linkedin

image

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • [x ] snowflake
  • other (specify: ____________)

The output of dbt --version:

0.17.0

The operating system you're using:
linux
**The output of python --version:3.8

Additional context

All of my code is still working but it seems like a waste to run this query for all models when only a single model is specified

@brittianwarner brittianwarner added bug Something isn't working triage labels Jul 30, 2020
@brittianwarner brittianwarner changed the title Snowflake: Show terse schema running for all models when only calling one model Snowflake: Show terse schema running for all models when only calling single model Jul 30, 2020
@jtcohen6 jtcohen6 added wontfix Not a bug or out of scope for dbt-core and removed triage labels Jul 31, 2020
@jtcohen6
Copy link
Contributor

Hey @brittianwarner, this is very much our intended behavior. I'll do my best to explain below. That said, I'm happy to keep the conversation going if you have other ideas.

dbt: At the beginning of each run, dbt caches information from the database to know about all the objects that exist in any of the databases + schemas where dbt plans to create objects. This is the most efficient and straightforward way to grab the information once, at the start of the run, rather than (as very old versions of dbt used to) querying metadata tables before every single model run. There isn't really a difference when you're only running one model, but when you're running many, it's huge.

Snowflake: Earlier this year, we changed from querying the information_schema to using show terse objects because it performs significantly better (#2174). Running show queries does not require a warehouse, so it does not queue up with other queries. If we were to try to filter via show objects like, using a case-insensitive pattern match, our understanding (from talking to Snowflake) is that this is no more performant than simply showing all objects; it limits the output, but Snowflake still has to scan all metadata records while pattern-matching. We would gain nothing performance-wise at the potential cost of excluding relevant metadata.

@brittianwarner
Copy link
Author

brittianwarner commented Jul 31, 2020

@jtcohen6 , thanks for the awesome explanation. Makes sense. The biggest reason I brought this up was because we are passing variables to tell the model which schema to look at. So for the example screenshot above you can see that there are two queries that failed because those schemas don't exist. Though, this doesn't really break anything, we are technically going to be running queries that are failing each time we execute DBT. Given this context, I will leave it up to your team to decide whether this is a big deal or not.

@jtcohen6
Copy link
Contributor

Got it, and appreciate the context. If you're materializing models in custom schemas, dbt should be trying to create those custom schemas if they don't already exist—is that different from what's happening on your end?

@brittianwarner brittianwarner changed the title Snowflake: Show terse schema running for all models when only calling single model Snowflake: Show terse object running for all models when only calling single model Jul 31, 2020
@brittianwarner
Copy link
Author

brittianwarner commented Jul 31, 2020

The variables for the schemas are passed so that the package/model knows which source and target schema to use for a specific query. In our case (using the screenshot above), we pass a number for these variables (ex: --vars '{"src_schema": "64", "target_schema": "64"}) so when we kick off a specific model where we know a schema exists (ex: linkedin), dbt is running the 'show terse objects' for all models, however the only model where 64 exists is linkedin.

@brittianwarner
Copy link
Author

@jtcohen6 Hope you had a good weekend. Just following up on this and making sure my last message made sense? Let me know if you need more info on my end.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 4, 2020

Hey @brittianwarner, I think that makes sense. If I follow correctly, you have a single var called src_schema that is used to define several different sources; those sources may or may not exist for a given value of src_schema.

In an another recent issue, I wrote a little about the opinionated principles and expectations that underly dbt's relationship with the database. When dbt compiles a project, it expects all the resources defined in that project to have a sensible working relationship with the state of the database. It expects source schemas to already exist, and to have permissions to grab metadata about them, in the same way that it expects to have permissions to create models it knows about (and the schemas for those models). These abstractions, and their baked-in assumptions, get us quite far 98% of the time.

dbt does not try to cleverly account for source schemas that may or may not exist. Last Friday, Claire published a discourse post that address this problem—with the intended audience of package creators, who may wish to write code in expectation of these edge cases.

All of that said, if I were you, I'd think about:

  1. Creating placeholder schemas in your database so that, for any given value (xx) of src_schema, there is always edw_eng.adwords_xx, edw_eng.hubspot_xx, edw_eng.linkedin_xx. This is the simplest, by far, but it may be controversial.
  2. Disabling certain source schemas when you know they don't exist, based on the value of src_schema, via dynamic Jinja expression. This makes sense if much more common for a schema to be present than missing, and if you know exactly when it's the latter.
version: 2

sources:
    - name: adwords_
      schema: "{{ 'adwords_' ~ var('src_schema') }}"
      enabled: "{{ if var('src_schema' not in ('12', '25', ...) }}"   # known subset missing adwords data
      database: edw_eng
      tables:
          - name: table1
            description: "abc123"
          - name: table2
            description: "def456"

    - name: adwords_02
      schema: adwords_02
      tables: *adwordstables

      # skip adwords_03 because it doesn't exist!

    - name: adwords_04
       schema: adwords_04
       tables: *adwordstables
  1. Using YML anchors (discourse) to define different all possible source schemas at once, with as little duplication of code as possible. This makes sense if you don't have a ton of potential values of the src_schema var.
version: 2

sources:
    - name: adwords_10
      schema: adwords_01
      database: edw_eng
      tables: &adwordstables
          - name: table1
            description: "abc123"
          - name: table2
            description: "def456"

    - name: adwords_11
      schema: adwords_02
      tables: *adwordstables

      # skip adwords_12 because it doesn't exist!

    - name: adwords_13
       schema: adwords_04
       tables: *adwordstables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants