Replace information_schema queries with faster alternatives on Snowflake #1877

pedromachados · 2019-10-29T17:22:09Z

Describe the feature

When dbt starts, it runs a query to the information_schema for every schema in the project. This happens even if the run involves a single model (single schema).

Each of these queries is taking anywhere from 4-20 seconds, presumably depending on how much load the overall Snowflake system has across accounts.

These queries seem to be running on the main thread and are therefore sequential. We have a project with 9 schemas with a time-to-first-model of close to 90 seconds. As you can imagine, this is a huge productivity drag.

We are contacting Snowflake about speeding up information_schema queries but this could also be improved if dbt ran these queries in multiple threads and if it only ran queries for the schemas involved in the given run.

Also, I believe the show tables or show views commands could be used in this particular case (these take in the order of 100-200 ms) instead of queries to the information schema.

Below is one of these queries which took over 12 seconds:

2019-10-29 12:00:19,554 (MainThread): Acquiring new snowflake connection "list_relations_without_caching".
2019-10-29 12:00:19,554 (MainThread): Re-using an available connection from the pool.
2019-10-29 12:00:19,554 (MainThread): Using snowflake connection "list_relations_without_caching".
2019-10-29 12:00:19,554 (MainThread): On list_relations_without_caching: BEGIN
2019-10-29 12:00:20,197 (MainThread): SQL status: SUCCESS 1 in 0.64 seconds
2019-10-29 12:00:20,197 (MainThread): Using snowflake connection "list_relations_without_caching".
2019-10-29 12:00:20,197 (MainThread): On list_relations_without_caching: select
      table_catalog as database,
      table_name as name,
      table_schema as schema,
      case when table_type = 'BASE TABLE' then 'table'
           when table_type = 'VIEW' then 'view'
           when table_type = 'MATERIALIZED VIEW' then 'materializedview'
           else table_type
      end as table_type
    from aibi_analytics_db.information_schema.tables
    where table_schema ilike 'dbt_pedro_sol_matching'
      and table_catalog ilike 'aibi_analytics_db'
2019-10-29 12:00:32,862 (MainThread): SQL status: SUCCESS 19 in 12.66 seconds

Describe alternatives you've considered

I inquired whether a macro could be used to override the information schema queries but was told it's not possible.

Additional context

Snowflake

Who will this benefit?

This will speed time-to-first-model for Snowflake projects with multiple schemas

The text was updated successfully, but these errors were encountered:

christopher-tin · 2019-11-01T17:54:20Z

+1 to this issue as it's a huge scalability concern for us as well. If we can modify this model metadata scan process to use the thread count in our profiles.yml file, it would be a great improvement.

drewbanin · 2019-11-01T19:22:28Z

Hey @pedromachados - I'm not sure that we'll want to start with parallelizing these queries - I'd be much more in favor of using show schemas, show tables, etc etc in lieu of information_schema queries! Even if we did parallelize these, if one of them takes 20 seconds to complete, that's still too slow for us to work with IMO.

I think we discussed this on Slack, but there are some real challenges we'd need to account for in using show... instead of select .. from information_schema.<table>.

For one, the show ... queries only return a maximum of something like 10k objects. If we tried to run show tables in database ..., there's a super real chance that we'd hit this maximum in even moderately sized warehouses! So, we'd need to use show tables in schema <database>.<schema> which is also challenging because we'd need to quote these identifiers exactly correctly. This is super doable for dbt, but quoting on Snowflake is always a big pain!

For two, show columns returns different data than the results returned from the information schema. This might be tractable for us, but it's a big change for us to make!

I'm super keen to male this change - going to queue it up for a future release.

pedromachados · 2019-11-01T20:38:24Z

@drewbanin Thanks for looking into this.

What if you run show schemas in database <database> first and then do the case-insensitive search in python to find the correct capitalization of the schema name? Then you can use it to run show tables in schema <database>.<schema> with the correct capitalization.

If you give me some pointers on where to go in the code I could take a stab at this and create a PR over the next couple of weeks.

pedromachados · 2019-11-05T16:09:08Z

@drewbanin
I reached out to Snowflake about the information_schema query performance and the differences between show tables and the results of the information_schema query.
They are looking into the first item but for the second one they suggested we use describe table.
I see that this does provide more specific information on datatypes compared to show columns.
I realize this is different from the schema queries we've been discussing in this issue but wonder if this is a viable solution to replace information schema queries completely.

drewbanin · 2019-11-05T16:57:17Z

Thanks @pedromachados! I didn't know about describe table ... on Snowflake. That does sound like a good approach for inspecting the columns in a give table. It looks like there are analogous describe database and describe schema queries that could also be tractable here. It doesn't look like describe database and describe schema are currently documented though :/

The thing I like about describe instead of show is that there doesn't appear to be a 10k limit on the returned resultset. Are you able to inquire with your Snowflake rep if these describe ... statements are fully supported at present?

pedromachados · 2019-11-08T02:22:00Z

Hi @drewbanin

Here is what they told me:

The Desc Table command is supported by us and there are no plans to discontinue it.
No documented limit on desc table as a table is unlikely to have 10k rows
The RESULT_SCAN function can be used in combination with DESC TABLE command to get its output of DESC TABLE command in a table format against which we can run our queries.
Similar to SHOW command, the DESC command also has a limit to 10k rows in the output. This may be an issue if a schema has more than 10k tables in it.
One engineer suggested the GET_DDL command to get the column details of the tables [perhaps in combination with show columns]
The information_schema views are the only ones with no limit in terms of how many rows they can return.

Based on this, when inspecting schemas, how do you feel about running one of the fast commands first and if you get 10k rows, you run a query to the information_schema?

nehiljain · 2019-11-27T22:18:38Z

@drewbanin is it possible to use equality predicate instead of ilike as an interim solution to not using information_schema

drewbanin · 2019-12-02T20:18:08Z

@nehiljain I did some timing work on this front a while back and I actually didn't notice any appreciable timing difference between = and ilike. The issue didn't appear to be an unrestrictive predicate, but instead, a fundamental slowness with queries against the information schema. There's no reason why queries against the information_schema.schemata table should take > 60 seconds (there should be one record in this table per schema) and yet, we sometimes see that happening in practice.

jtalmi · 2019-12-03T22:21:17Z

When you execute "dbt run", doesn't it run the same information_schema query for every model?

e.g.:

select
          column_name,
          data_type,
          character_maximum_length,
          numeric_precision,
          numeric_scale

      from
      DB_APPS.information_schema.columns

      where table_name ilike 'model1'
        
        and table_schema ilike 'test_schema'
        
        
        and table_catalog ilike 'database'
        
      order by ordinal_position

These queries are taking minutes for us. Why not just pull the information_schema.columns into memory once and filter it using python? Or even batching it if it's a very large table.

Alternatively, what if dbt built a model at the beginning of the DAG that selected information schema columns like so: "select * from information_schema.columns"? If dbt gave us an option in project.yml to specificy source table of information schema, we could create a table model pointing to it and then dbt could run using that.

nehiljain · 2019-12-04T20:34:15Z

@drewbanin I just has a call with Snowflake engineers and they told me that the metadata engine does not respect SQL Query Order of Execution so ilike matters a lot. As the size of the layer grows it takes longer to apply by the filter before returning the data. select * from DB_APPS.information_schema.columns is some kind of a view on top of their datastore. I think using a materialised information schema before we run all our queries and using that is a good solution here.

drewbanin · 2019-12-04T20:40:51Z

Thanks for the info @nehiljain. Are you able to run some benchmarking queries? I'd be curious how the performance of the information_schema queries that dbt runs changes if you replace ilike with =. This would certainly be a simpler change than using show objects in ... everywhere!

This is an important thing to check, as Snowflake have made similar claims in the past, but I didn't notice any huge performance difference between like and = in practice the last time I checked! This was months ago, so unsure if that's changed recently.

drewbanin · 2019-12-10T19:08:54Z

hey @jtalmi - different information_schema requests are handled differently. dbt generally needs to know about:

the tables and views that exist in the database
for certain models (ie. incremental models) the columns in the tables

dbt fetches information about tables and views on a schema basis. If dbt needs to know if a table called analytics.my_table exists, dbt will fetch all of the tables in the analytics schema, then cache those results in memory. Since dbt exists to create tables and views, you can imagine that there's a lot of tricky logic required to keep this cache in sync with reality!

For incremental models, dbt needs to know the columns present in the destination incremental table. dbt uses this information to build an insert statement that maps the columns in the model SQL query to the columns in the physical table in the database. We could certainly cache these results in memory as well, but in practice, this doesn't work so well. When we've tried to do things like this in the past, we've run into opaque Snowflake errors indicating that the query has returned too much data (see: #1127).

So, Snowflake really puts us in a pretty rough spot here. If we try to fetch data more aggressively and cache the results, we might see Information schema query returned too much data. Please repeat query with more selective predicates. If we are instead more conservative with individual queries to the informations schema, then we run into the issues you're seeing here @jtalmi

I like you're idea of copying the information schema at startup. I just fear that this will fail for some Snowflake users in opaque ways with basically no recourse. I'll also say: sometimes dbt needs to run SQL to build tables, then check the types of the columns in that new table. Some sort of global cache that's populated at startup would unfortunately not help us there.

I really do appreciate your input here! If you have any other thoughts/questions/comments, I'd love to hear them!

pedromachados · 2019-12-10T22:30:55Z

@drewbanin do the describe commands still look like a good way to go? I think the only issue is dealing with the 10k record limit we might run into on large schemas.

If this is the only obstacle, do you think a workaround could be to run describe first and if we get 10k records (or an error saying the query returned too many records), then run the information_schema query?

drewbanin · 2019-12-11T01:17:57Z

Hey y'all, are any of you able to check out this (very developmental) branch? #1999

I gave this a spin on our own Snowflake instance and the results weren't super inspiring (it shaved 2s of a 12s runtime, which isn't nothing!) but I'm also not seeing multiple-minute delays for the information schema queries here either.

Please check out the description in the PR for information about the caveats of this branch. It will not work if your databases or schema names require quotes, or if you have > 10,000 tables in a single schema.

If you're able to run this locally for your dbt projects, please tell me about any performance benefit you observe in the PR comments! Let me know if you have any questions, or if you need a hand with running dbt from this branch.

drewbanin · 2020-02-27T18:08:32Z

Some updates:

see Use show schemas instead of selecting from information_schema.schemata on Snowflake #2166 for an issue describing a change to how dbt fetches the existing schemas in a database
see Use threadpools for filling the cache and listing schemas (#2127) #2157 (merged) for a PR that parallelizes queries to find the relations in the information schema

I'm going to close this issue out in favor of more discrete, actionable issues, but I want to thank everyone for contributing on this thread! Happy to continue the discussion below if anyone has further thoughts :)

matt-winkler · 2022-09-04T14:21:38Z

@jtcohen6 I noticed that dbt docs generate still hits the information schema vs. using describe table on snowflake. Is there a difference in the data we're able to return from the info schema that's essential for producing correct docs, or could we theoretically speed up docs generation in a similar manner?

jtcohen6 · 2022-09-05T15:24:52Z

@matt-winkler We can hit the information schema once per schema/database, rather than running a separate describe query for every single table that needs documenting. We could run those describe queries in parallel (up to number of --threads), but I'd be surprised if that's faster overall than running one query per schema/database (also in parallel). Still, could be worth benchmarking, if you're up for it!

Longer-term, I want to see us getting away from massive batch queries run during the docs generate command, in favor of running metadata queries for dbt objects right after they've been materializing, and then exposing that metadata via structured logging. Then, coupled with a service to ingest and aggregate metadata from those logs, the docs will be "always ready," and in near-real time. We're still a little ways away from that future, but we're taking concrete steps to get there, and that's where I want to keep our focus.

pedromachados added enhancement New feature or request triage labels Oct 29, 2019

pedromachados changed the title ~~Parallelize information_schema queries on Snowflake~~ Reduce time-to-first-model on Snowflake Nov 1, 2019

drewbanin added snowflake and removed triage labels Nov 1, 2019

drewbanin added this to the 0.15.2 milestone Nov 1, 2019

drewbanin changed the title ~~Reduce time-to-first-model on Snowflake~~ Replace information_schema queries with faster alternatives on Snowflake Nov 25, 2019

drewbanin mentioned this issue Dec 4, 2019

Snowflake Information Schema Query Error #1127

Closed

drewbanin removed this from the Barbara Gittings milestone Feb 19, 2020

cmcarthur mentioned this issue Feb 20, 2020

Evaluate schema discovery queries that happen at startup #2127

Closed

norton120 mentioned this issue Feb 26, 2020

(snowflake.connector.errors.ProgrammingError): Information schema query returned too much data. Health-Union/snowshu#6

Open

drewbanin mentioned this issue Feb 27, 2020

Use show schemas instead of selecting from information_schema.schemata on Snowflake #2166

Closed

drewbanin closed this as completed Feb 27, 2020

koszti mentioned this issue Mar 11, 2020

[AP-591] Use SHOW SCHEMAS|TABLES|COLUMNS instead of INFORMATION_SCHEMA transferwise/pipelinewise-target-snowflake#67

Merged

drewbanin mentioned this issue Mar 30, 2020

Snowflake Describe queries to show columns in a table #2260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace information_schema queries with faster alternatives on Snowflake #1877

Replace information_schema queries with faster alternatives on Snowflake #1877

pedromachados commented Oct 29, 2019

christopher-tin commented Nov 1, 2019

drewbanin commented Nov 1, 2019

pedromachados commented Nov 1, 2019

pedromachados commented Nov 5, 2019

drewbanin commented Nov 5, 2019

pedromachados commented Nov 8, 2019

nehiljain commented Nov 27, 2019 •

edited

Loading

drewbanin commented Dec 2, 2019

jtalmi commented Dec 3, 2019 •

edited

Loading

nehiljain commented Dec 4, 2019

drewbanin commented Dec 4, 2019

drewbanin commented Dec 10, 2019

pedromachados commented Dec 10, 2019

drewbanin commented Dec 11, 2019 •

edited

Loading

drewbanin commented Feb 27, 2020

matt-winkler commented Sep 4, 2022

jtcohen6 commented Sep 5, 2022

Replace information_schema queries with faster alternatives on Snowflake #1877

Replace information_schema queries with faster alternatives on Snowflake #1877

Comments

pedromachados commented Oct 29, 2019

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

christopher-tin commented Nov 1, 2019

drewbanin commented Nov 1, 2019

pedromachados commented Nov 1, 2019

pedromachados commented Nov 5, 2019

drewbanin commented Nov 5, 2019

pedromachados commented Nov 8, 2019

nehiljain commented Nov 27, 2019 • edited Loading

drewbanin commented Dec 2, 2019

jtalmi commented Dec 3, 2019 • edited Loading

nehiljain commented Dec 4, 2019

drewbanin commented Dec 4, 2019

drewbanin commented Dec 10, 2019

pedromachados commented Dec 10, 2019

drewbanin commented Dec 11, 2019 • edited Loading

drewbanin commented Feb 27, 2020

matt-winkler commented Sep 4, 2022

jtcohen6 commented Sep 5, 2022

nehiljain commented Nov 27, 2019 •

edited

Loading

jtalmi commented Dec 3, 2019 •

edited

Loading

drewbanin commented Dec 11, 2019 •

edited

Loading