Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use show xxx queries instead of information schema on Snowflake #1999

Closed
wants to merge 1 commit into from

Conversation

drewbanin
Copy link
Contributor

@drewbanin drewbanin commented Dec 11, 2019

Work in progress. This branch uses queries like show xxx in yyy instead of hitting the information schema. This hope is that this approach is more performant than the existing approach.

TODO:

  • Verify that this is more performant than the information schema alternatives
  • Verify the logic used to get columns in tables
  • Handle quoting correctly
  • Handle the case when these queries return 10k records (some values are probably not returned...)

@pedromachados
Copy link

pedromachados commented Dec 11, 2019

@drewbanin I ran a quick test against a project that takes a long time to start. Running a specific model selector that takes about 80 sec with dbt 0.15, this new version took 40 sec.

This project uses 9 custom schemas. Since the introspection queries are run sequentially, the time-to-first-model is proportional to the number of schemas in the project.

What I noticed is that the show x commands run very fast but select x from from table(result_scan(last_query_id())) takes multiple seconds.

Can you use describe x and retrieve the results without calling result_scan? I used that approach in a macro and it worked OK. The describe object commands are now documented.

@drewbanin
Copy link
Contributor Author

Awesome, thanks for giving it a spin @pedromachados!

What I noticed is that the show x commands run very fast but select x from from table(result_scan(last_query_id())) takes multiple seconds.

I noticed this too! This happens because Snowflake does not allow us to submit multiple statements in the same query. Accordingly, there's some code in dbt that splits the show x; select ... from result_scan() statement into two separate queries. It's unclear to me why splitting the two queries up appears to be meaningfully slower than running them both in a single statement via the web console, but I definitely did observe this too.

So, one other option here is to just run show x, then post-process the results in Python instead of in SQL. That will save on the roundtrip and shouldn't be too hard to implement, but it is kind of a PITA....

I'm happy to sub out show columns in table for describe table, but I did hit a weird permissions issue that I wanted to check out further before proceeding. It was unclear to me if I was doing something silly, or if desc ... required different permissions than show x in y.

Couple other things here:

  • if you only ran one model, then it's kind of foolish for dbt to find the tables in all 9 of your custom schemas. Avoiding these queries altogether is always going to result in the best possible performance here
  • how much of your updated 40s build time is attributable to these queries? One way to check this is to inspect the logs/dbt.log file and add up all the timing information for these introspective queries. I guess I'm curious if some large proportion of the runtime is attributable to running model code, or if most of it was spent hitting the information schema here.

@jtalmi
Copy link

jtalmi commented Dec 11, 2019

fyi, i just learned that roles needs monitor privileges to run desc on a warehouse (usage + operation are not sufficient) whereas the actual error message is "insufficient permissions to operate on X". desc could be have quirky privileges requirements

@nehiljain
Copy link

I am not sure if SHOW commands are any more performant than directly hitting information schema. We use have a ticket open with Snowflake to look into why it takes 10 mins to run a show grants query. It potentially uses the metadata layer as well.
image

It will be great to have some performance testing done on this approach to prove if this is better than using information schema.

@pedromachados
Copy link

@drewbanin I did another run. From start to finish, it took about 60 sec:

  • 15 sec. are spent before the introspection queries start
  • The introspection queries take 19 sec
  • Building models and running post hooks takes 24 sec.

10 schemas are inspected (8 custom + seed + default). I agree that dbt could be smarter about analyzing only schemas involved in a given run.

The queries that use result_scan take about 1 second each. The show terse objects in x ones usually take about 0.2-0.3 sec with the longest one taking 0.65 sec.

@drewbanin
Copy link
Contributor Author

@jtalmi you said

fyi, i just learned that roles needs monitor privileges to run desc on a warehouse (usage + operation are not sufficient) whereas the actual error message is "insufficient permissions to operate on X". desc could be have quirky privileges requirements

I did notice some weird quirks around describe <object> regarding permissions. The describe table|view docs aren't clear about this, and the describe schema statement doesn't appear to be documented at all. I do think that if we make a change here, show <objects> in <object> is going to be our best bet.

@nehiljain were you able to run dbt with this branch? The information schema queries on the Snowflake cluster that I have access to are relatively fast, so I'm not able to tell if show <objects> is significantly faster than selecting from the information schema.

@drewbanin
Copy link
Contributor Author

These queries don't appear to be meaningfully faster than hitting the information schema in practice :/

Closing this out, but happy to reopen if anyone finds that they're getting better performance characteristics with this approach

@drewbanin drewbanin closed this Jan 30, 2020
@kwigley kwigley deleted the feature/snowflake-show-queries branch February 5, 2021 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants