Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert all embedded adapter SQL into macros (#1204) #1207

Merged

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Jan 3, 2019

Note: this is based off the feature/dynamic-database-config branch from #1202
Resolves #1204

This PR converts all the SQL code embedded in various adapters into macros, assuming I didn't miss anything.

Most of this PR was straightforward conversion from adapter methods to macros and handling the weird edge-cases around behavioral differences between macros and executing raw sql. For example, the get columns in table query previously bypassed agate and sidestepped an irritating issue with numerics.

One thing this exposed is that it would be nice if the adapter_macro macro knew about inheritance! Having to do this on Redshift feels pretty bad:

{% macro redshift__do_thing(arg) %}
  {{ return(postgres__do_thing(arg) }}
{% endmacro %}

The biggest novelty in this PR is almost certainly the creation and use of an 'internal manifest' which I described in its own section below.

There is some kind of convoluted handling+passing-through of model_name/connection_name and connections themselves to allow for macros to call adapter methods that are themselves implemented by macros in a sane way that doesn't drop temp tables and still releases connections. It might be nice to have a global (there, I said it!) 'current connection' project in thread-local storage, or something instead of the model_name parameter and get_connection. It feels like manually managing memory in C.

This PR also makes a verify_database adapter method available to postgres/redshift macros, so their implementations of required macros can enforce those adapters' specific constraints (no cross-database queries are allowed).

Internal Manifest

The internal manifest is a manifest containing only the macros from dbt's 'global_project' + any loaded adapters' macros that gets stashed on adapters the first time they try to access it. Usually this means just one adapter, but integration tests/using dbt as a library means not necessarily!

The primary reason for the internal manifest to exist is that you can't parse anything that uses adapter functions until you have loaded the macros underlying those adapter functions. So any reasonable bootstrapping process will parse "internal" stuff first, then make it available when parsing user-land code.

Once you've already done that, there are some extra benefits you can get "for free": All adapter-level operations use the internal manifest, which avoids having to parse any external stuff, avoids accidentally overwriting anything important, and means we don't have to search in each possible namespace individually.

The ugliest part is probably the fact that the internal manifest has to be lazy-loaded to avoid adapter creation recursing forever.

Manifest construction changes

As part of the above change, I added a tiny bit of logic to avoid double-searching those macros. As part of that change it made sense to move manifest construction out of the compiler, and dependency project loading into the GraphLoader itself. I think it came out pretty nicely! Note that we will still double-parse the internal projects.

@beckjake beckjake force-pushed the feature/macros-for-everything branch from c0322f8 to 115e3d5 Compare January 3, 2019 23:09
@beckjake beckjake changed the title Convert all embedded adapter SQL into macros Convert all embedded adapter SQL into macros (#1204) Jan 3, 2019
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in here are super exciting! Let's have a chat about the SQL we use for pg/Redshift in #1202, but that notwithstanding, this is looking pretty good imo

{% macro default__check_schema_exists(database, schema) -%}
{% call statement('check_schema_exists', fetch_result=True, auto_begin=False) -%}
select count(*)
from {{ information_schema_name(database) }}.schemata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mentioned this in #1202 but these information_schema queries are worth discussing. Just flagging this so we don't miss it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've resolved these issues by implementing postgres/redshift versions that query pg_*. The catalog still uses information_schema, but it always has - is that something to change? The permissions implications aren't totally clear to me on that front.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, that's interesting. I think we do get more info from the information_schema tables, so I think it's probably right to continue using them here.

The catalog is solely used for docs (correct me if i'm wrong), so I guess that's ok. If a user doesn't have read access to a table, then it just won't appear in the documentation. That's not the most ideal scenario, and it's not something I think we've done intentionally, but it's also pretty reasonable IMO.

The bigger problem with information_schema queries is that dbt might fail to see a table, then try to create it rather than eg. dropping it first. The user will get an error message either way, but the drop error will be an obvious permissions-related error, whereas the "this already exists" error will look like a cache inconsistency problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's always this stack exchange post, but I would really rather not use it: https://dba.stackexchange.com/a/195721

@beckjake beckjake force-pushed the feature/macros-for-everything branch 2 times, most recently from d3f510f to def23d8 Compare January 4, 2019 18:18
RENAME_RELATION_MACRO_NAME = 'rename_relation'
TRUNCATE_RELATION_MACRO_NAME = 'truncate_relation'
DROP_RELATION_MACRO_NAME = 'drop_relation'
ALTER_COLUMN_TYPE_MACRO_NAME = 'alter_column_type'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it occurs to me that these would be a great starting point for building slightly lower level, "unit" tests that could be re-used across adapters. we're missing a standard create_relation_as macro. if we had that, you could imagine a test that does, in order:

check if schema exists > drop schema > create schema > create relation as > list relations > rename relation > list relations > list columns in relation > alter column type > list columns in relation > drop relation

if that works, then you can be somewhat confident that you are ready to start building materialization logic. what do you all think? I can open an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this idea. I wish I had it right this moment for my presto adapter.

@beckjake beckjake force-pushed the feature/dynamic-database-config branch from 44fa678 to dadab35 Compare January 8, 2019 18:36
@beckjake beckjake force-pushed the feature/macros-for-everything branch from def23d8 to 4458388 Compare January 8, 2019 18:50
Adapters now store an internal manifest that only has the dbt internal projects
Adapters use that manifest if none is provided to execute_manifest
The internal manifest is lazy-loaded to avoid recursion issues
Moved declared plugin paths down one level
Connection management changes to accomadate calling macro -> adapter -> macro
Split up precision and scale when describing number columns so agate doesn't eat commas
Manifest building now happens in the RunManager instead of the compiler

Now macros:
  create/drop schema
  get_columns_in_relation
  alter column type
  rename/drop/truncate
  list_schemas/check_schema_exists
  list_relations_without_caching
@beckjake beckjake force-pushed the feature/macros-for-everything branch from 4458388 to 2516676 Compare January 15, 2019 15:08
@@ -13,7 +13,7 @@

{% macro cluster_by(raw_cluster_by) %}
{%- if raw_cluster_by is not none -%}
cluster by
cluster by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the space here significant? I can see this writing out cluster byFIELD_NAME based on the whitespace trimming in this macro. Haven't test yet though...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm actually, I just think the indentation is funky. please ignore

@cmcarthur
Copy link
Member

@beckjake is the internal manifest here primarily for performance benefit, or because of some other underlying issue? it sounds like a mix:

avoids accidentally overwriting anything important, and means we don't have to search in each possible namespace individually.

each adapter gets its own package name, so it seems like there should not be any issues with macros overwriting one another between plugins... am i missing something?

was searching multiple namespaces for the right macro taking up a lot of time?

the extra statefulness adds an extra setup step to using dbt as a library or in an alternative entrypoints context. and that overhead primarily addresses issues with integration tests? anyway i'd like to strip it out if it's not giving us some clear benefit.

I do really like the idea of pre-compiling packages into a manifest. that would save us some parsing & compilation work down the line. but let's not do that yet.

@beckjake
Copy link
Contributor Author

beckjake commented Jan 15, 2019

The reason for the internal manifest is that we need to be able to do adapter things without necessarily having a project available if we want to run all our SQL via macros. These functions have to be available while parsing models/macros that use them.

each adapter gets its own package name, so it seems like there should not be any issues with macros overwriting one another between plugins... am i missing something?

Nope. But you have two choices when looking up a macro: you can provide a package name or you can provide no package name and all packages will be searched. In the former case, you have to know what the package name for your adapter is, search that, then fall back to the dbt package. Ugly! In the latter case, a project-level macro could do something like postgres__do_thing() and override a default implementation, which also seems bad!

was searching multiple namespaces for the right macro taking up a lot of time?

It was more of a complexity and risk mitigation issue than a time issue. But it's really just a side-effect/bonus of solving the actual chicken and egg issue with parsing/list_relations interactions.

the extra statefulness adds an extra setup step to using dbt as a library or in an alternative entrypoints context. and that overhead primarily addresses issues with integration tests? anyway i'd like to strip it out if it's not giving us some clear benefit.

Does it? If you don't use anything that requires a macro execution you won't incur the cost, and if you do use anything that requires a macro execution you will need to pay the cost anyway. Is there an alternative suggestion that solves this general issue?

def list_relations_without_caching(self, database, schema,
model_name=None):
assert database is not None
assert schema is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of these snuck in -- the plan is / was to remove these right?

@cmcarthur
Copy link
Member

@beckjake I think I understand now... the primary reason for the internal manifest to exist is that you can't parse anything that uses adapter functions until you have loaded the macros underlying those adapter functions. So any reasonable bootstrapping process will parse "internal" stuff first, then make it available when parsing user-land code.

And, if we are doing the work to add the internal manifest, we might as well use it for the secondary benefits.

Does that sound right to you? Sorry that I missed that on the first pass -- the description mentioned a lot of those secondary benefits, but it wasn't clear that this was a total necessity on first reading / without some deeper thought.

@beckjake
Copy link
Contributor Author

@cmcarthur exactly! Sorry, I guess I have been living in my own little already-solved world since I implemented this. I'll update the PR description so it's more clear.

@beckjake beckjake force-pushed the feature/macros-for-everything branch 8 times, most recently from d673fa6 to 287fca3 Compare January 16, 2019 19:29
Postgres/Redshift permissions issues:
 - use pg_views/pg_tables to list relations
 - use pg_namespace to list schemas and check schema existence
Catalog:
 - Redshift: fix error, add database check
 - Snowflake: Use the information_schema_name macro
Snowflake/default: Quote the database properly in information_schema_name
@beckjake beckjake force-pushed the feature/macros-for-everything branch from 287fca3 to 42deae8 Compare January 16, 2019 19:36
@beckjake beckjake force-pushed the feature/macros-for-everything branch from 42deae8 to 1596174 Compare January 16, 2019 19:49
@beckjake beckjake merged commit c80792d into feature/dynamic-database-config Jan 16, 2019
@beckjake beckjake deleted the feature/macros-for-everything branch January 16, 2019 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants