Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace information_schema queries with faster alternatives on Snowflake #1877

Closed
pedromachados opened this issue Oct 29, 2019 · 17 comments
Closed
Labels
enhancement New feature or request snowflake

Comments

@pedromachados
Copy link

Describe the feature

When dbt starts, it runs a query to the information_schema for every schema in the project. This happens even if the run involves a single model (single schema).

Each of these queries is taking anywhere from 4-20 seconds, presumably depending on how much load the overall Snowflake system has across accounts.

These queries seem to be running on the main thread and are therefore sequential. We have a project with 9 schemas with a time-to-first-model of close to 90 seconds. As you can imagine, this is a huge productivity drag.

We are contacting Snowflake about speeding up information_schema queries but this could also be improved if dbt ran these queries in multiple threads and if it only ran queries for the schemas involved in the given run.

Also, I believe the show tables or show views commands could be used in this particular case (these take in the order of 100-200 ms) instead of queries to the information schema.

Below is one of these queries which took over 12 seconds:

2019-10-29 12:00:19,554 (MainThread): Acquiring new snowflake connection "list_relations_without_caching".
2019-10-29 12:00:19,554 (MainThread): Re-using an available connection from the pool.
2019-10-29 12:00:19,554 (MainThread): Using snowflake connection "list_relations_without_caching".
2019-10-29 12:00:19,554 (MainThread): On list_relations_without_caching: BEGIN
2019-10-29 12:00:20,197 (MainThread): SQL status: SUCCESS 1 in 0.64 seconds
2019-10-29 12:00:20,197 (MainThread): Using snowflake connection "list_relations_without_caching".
2019-10-29 12:00:20,197 (MainThread): On list_relations_without_caching: select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      case when table_type = 'BASE TABLE' then 'table'
           when table_type = 'VIEW' then 'view'
           when table_type = 'MATERIALIZED VIEW' then 'materializedview'
           else table_type
      end as table_type
    from aibi_analytics_db.information_schema.tables
    where table_schema ilike 'dbt_pedro_sol_matching'
      and table_catalog ilike 'aibi_analytics_db'
2019-10-29 12:00:32,862 (MainThread): SQL status: SUCCESS 19 in 12.66 seconds

Describe alternatives you've considered

I inquired whether a macro could be used to override the information schema queries but was told it's not possible.

Additional context

Snowflake

Who will this benefit?

This will speed time-to-first-model for Snowflake projects with multiple schemas

@pedromachados pedromachados added enhancement New feature or request triage labels Oct 29, 2019
@pedromachados pedromachados changed the title Parallelize information_schema queries on Snowflake Reduce time-to-first-model on Snowflake Nov 1, 2019
@christopher-tin
Copy link

+1 to this issue as it's a huge scalability concern for us as well. If we can modify this model metadata scan process to use the thread count in our profiles.yml file, it would be a great improvement.

@drewbanin
Copy link
Contributor

Hey @pedromachados - I'm not sure that we'll want to start with parallelizing these queries - I'd be much more in favor of using show schemas, show tables, etc etc in lieu of information_schema queries! Even if we did parallelize these, if one of them takes 20 seconds to complete, that's still too slow for us to work with IMO.

I think we discussed this on Slack, but there are some real challenges we'd need to account for in using show... instead of select .. from information_schema.<table>.

For one, the show ... queries only return a maximum of something like 10k objects. If we tried to run show tables in database ..., there's a super real chance that we'd hit this maximum in even moderately sized warehouses! So, we'd need to use show tables in schema <database>.<schema> which is also challenging because we'd need to quote these identifiers exactly correctly. This is super doable for dbt, but quoting on Snowflake is always a big pain!

For two, show columns returns different data than the results returned from the information schema. This might be tractable for us, but it's a big change for us to make!

I'm super keen to male this change - going to queue it up for a future release.

@drewbanin drewbanin added snowflake and removed triage labels Nov 1, 2019
@drewbanin drewbanin added this to the 0.15.2 milestone Nov 1, 2019
@pedromachados
Copy link
Author

@drewbanin Thanks for looking into this.

What if you run show schemas in database <database> first and then do the case-insensitive search in python to find the correct capitalization of the schema name? Then you can use it to run show tables in schema <database>.<schema> with the correct capitalization.

If you give me some pointers on where to go in the code I could take a stab at this and create a PR over the next couple of weeks.

@pedromachados
Copy link
Author

@drewbanin
I reached out to Snowflake about the information_schema query performance and the differences between show tables and the results of the information_schema query.
They are looking into the first item but for the second one they suggested we use describe table.
I see that this does provide more specific information on datatypes compared to show columns.
I realize this is different from the schema queries we've been discussing in this issue but wonder if this is a viable solution to replace information schema queries completely.

@drewbanin
Copy link
Contributor

Thanks @pedromachados! I didn't know about describe table ... on Snowflake. That does sound like a good approach for inspecting the columns in a give table. It looks like there are analogous describe database and describe schema queries that could also be tractable here. It doesn't look like describe database and describe schema are currently documented though :/

The thing I like about describe instead of show is that there doesn't appear to be a 10k limit on the returned resultset. Are you able to inquire with your Snowflake rep if these describe ... statements are fully supported at present?

@pedromachados
Copy link
Author

Hi @drewbanin

Here is what they told me:

  • The Desc Table command is supported by us and there are no plans to discontinue it.
  • No documented limit on desc table as a table is unlikely to have 10k rows
  • The RESULT_SCAN function can be used in combination with DESC TABLE command to get its output of DESC TABLE command in a table format against which we can run our queries.
  • Similar to SHOW command, the DESC command also has a limit to 10k rows in the output. This may be an issue if a schema has more than 10k tables in it.
  • One engineer suggested the GET_DDL command to get the column details of the tables [perhaps in combination with show columns]
  • The information_schema views are the only ones with no limit in terms of how many rows they can return.

Based on this, when inspecting schemas, how do you feel about running one of the fast commands first and if you get 10k rows, you run a query to the information_schema?

@drewbanin drewbanin changed the title Reduce time-to-first-model on Snowflake Replace information_schema queries with faster alternatives on Snowflake Nov 25, 2019
@nehiljain
Copy link

nehiljain commented Nov 27, 2019

@drewbanin is it possible to use equality predicate instead of ilike as an interim solution to not using information_schema

@drewbanin
Copy link
Contributor

@nehiljain I did some timing work on this front a while back and I actually didn't notice any appreciable timing difference between = and ilike. The issue didn't appear to be an unrestrictive predicate, but instead, a fundamental slowness with queries against the information schema. There's no reason why queries against the information_schema.schemata table should take > 60 seconds (there should be one record in this table per schema) and yet, we sometimes see that happening in practice.

@jtalmi
Copy link

jtalmi commented Dec 3, 2019

When you execute "dbt run", doesn't it run the same information_schema query for every model?

e.g.:

select
          column_name,
          data_type,
          character_maximum_length,
          numeric_precision,
          numeric_scale

      from
      DB_APPS.information_schema.columns

      where table_name ilike 'model1'
        
        and table_schema ilike 'test_schema'
        
        
        and table_catalog ilike 'database'
        
      order by ordinal_position

These queries are taking minutes for us. Why not just pull the information_schema.columns into memory once and filter it using python? Or even batching it if it's a very large table.

Alternatively, what if dbt built a model at the beginning of the DAG that selected information schema columns like so: "select * from information_schema.columns"? If dbt gave us an option in project.yml to specificy source table of information schema, we could create a table model pointing to it and then dbt could run using that.

@nehiljain
Copy link

@drewbanin I just has a call with Snowflake engineers and they told me that the metadata engine does not respect SQL Query Order of Execution so ilike matters a lot. As the size of the layer grows it takes longer to apply by the filter before returning the data. select * from DB_APPS.information_schema.columns is some kind of a view on top of their datastore. I think using a materialised information schema before we run all our queries and using that is a good solution here.

@drewbanin
Copy link
Contributor

Thanks for the info @nehiljain. Are you able to run some benchmarking queries? I'd be curious how the performance of the information_schema queries that dbt runs changes if you replace ilike with =. This would certainly be a simpler change than using show objects in ... everywhere!

This is an important thing to check, as Snowflake have made similar claims in the past, but I didn't notice any huge performance difference between like and = in practice the last time I checked! This was months ago, so unsure if that's changed recently.

@drewbanin
Copy link
Contributor

hey @jtalmi - different information_schema requests are handled differently. dbt generally needs to know about:

  1. the tables and views that exist in the database
  2. for certain models (ie. incremental models) the columns in the tables

dbt fetches information about tables and views on a schema basis. If dbt needs to know if a table called analytics.my_table exists, dbt will fetch all of the tables in the analytics schema, then cache those results in memory. Since dbt exists to create tables and views, you can imagine that there's a lot of tricky logic required to keep this cache in sync with reality!

For incremental models, dbt needs to know the columns present in the destination incremental table. dbt uses this information to build an insert statement that maps the columns in the model SQL query to the columns in the physical table in the database. We could certainly cache these results in memory as well, but in practice, this doesn't work so well. When we've tried to do things like this in the past, we've run into opaque Snowflake errors indicating that the query has returned too much data (see: #1127).

So, Snowflake really puts us in a pretty rough spot here. If we try to fetch data more aggressively and cache the results, we might see Information schema query returned too much data. Please repeat query with more selective predicates. If we are instead more conservative with individual queries to the informations schema, then we run into the issues you're seeing here @jtalmi

I like you're idea of copying the information schema at startup. I just fear that this will fail for some Snowflake users in opaque ways with basically no recourse. I'll also say: sometimes dbt needs to run SQL to build tables, then check the types of the columns in that new table. Some sort of global cache that's populated at startup would unfortunately not help us there.

I really do appreciate your input here! If you have any other thoughts/questions/comments, I'd love to hear them!

@pedromachados
Copy link
Author

@drewbanin do the describe commands still look like a good way to go? I think the only issue is dealing with the 10k record limit we might run into on large schemas.

If this is the only obstacle, do you think a workaround could be to run describe first and if we get 10k records (or an error saying the query returned too many records), then run the information_schema query?

@drewbanin
Copy link
Contributor

drewbanin commented Dec 11, 2019

Hey y'all, are any of you able to check out this (very developmental) branch? #1999

I gave this a spin on our own Snowflake instance and the results weren't super inspiring (it shaved 2s of a 12s runtime, which isn't nothing!) but I'm also not seeing multiple-minute delays for the information schema queries here either.

Please check out the description in the PR for information about the caveats of this branch. It will not work if your databases or schema names require quotes, or if you have > 10,000 tables in a single schema.

If you're able to run this locally for your dbt projects, please tell me about any performance benefit you observe in the PR comments! Let me know if you have any questions, or if you need a hand with running dbt from this branch.

@drewbanin
Copy link
Contributor

Some updates:

I'm going to close this issue out in favor of more discrete, actionable issues, but I want to thank everyone for contributing on this thread! Happy to continue the discussion below if anyone has further thoughts :)

@matt-winkler
Copy link
Contributor

@jtcohen6 I noticed that dbt docs generate still hits the information schema vs. using describe table on snowflake. Is there a difference in the data we're able to return from the info schema that's essential for producing correct docs, or could we theoretically speed up docs generation in a similar manner?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 5, 2022

@matt-winkler We can hit the information schema once per schema/database, rather than running a separate describe query for every single table that needs documenting. We could run those describe queries in parallel (up to number of --threads), but I'd be surprised if that's faster overall than running one query per schema/database (also in parallel). Still, could be worth benchmarking, if you're up for it!

Longer-term, I want to see us getting away from massive batch queries run during the docs generate command, in favor of running metadata queries for dbt objects right after they've been materializing, and then exposing that metadata via structured logging. Then, coupled with a service to ingest and aggregate metadata from those logs, the docs will be "always ready," and in near-real time. We're still a little ways away from that future, but we're taking concrete steps to get there, and that's where I want to keep our focus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request snowflake
Projects
None yet
Development

No branches or pull requests

7 participants