-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgres: ability to create indexes #3106
Conversation
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start here @arzavj, and thanks for taking so much into consideration for the first go!
Let's figure out the right handoff points between create_indexes
and the materialization code. I don't think it makes sense to pass back a semicolon-delimited string and append it to post_hooks
. I'm drawing inspiration from persist_docs
, which feels like clear, compelling, and quite similar functionality.
Once we're settled on the syntax, we'll definitely need to add tests for the various permutations.
Also note that I'm assuming snakecased column and table names.
I think you're good! The model name, if quoted and weirdly cased, may "just work" because of how dbt prints {{ relation }}
. If the columns are weirdly quoted/cased, the user will have to account for that when defining the indexes
config.
core/dbt/include/global_project/macros/materializations/table/table.sql
Outdated
Show resolved
Hide resolved
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
1 similar comment
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
@jtcohen6 please take a look at the updated PR based on the comments above. Some open questions I have:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @arzavj, this is really coming along!
- You mentioned "avoid any operation unless the model both has an index defined and is running on an adapter with an implementation of create_indexes". Is the approach I've take correct? How does one check to see whether an adapter has implemented get_create_index_sql?
If an adapter has implemented get_create_index_sql
, dbt will use its version; otherwise, dbt will use default__get_create_index_sql
.
As a feature, indexes are unique to Postgres among the "core four" databases, but they're pretty common out in the world of databases. That's why I think we should implement create_indexes
in such a way that:
- By default, it does nothing
- At the same time, any database-agnostic plumbing is reusable (= included in the default implementation)
- Any PostgreSQL-specific components are wrapped in the
postgres__
implementation
The best way to accomplish this, to my mind, was two dispatch macros:
create_indexes
: Called by materialization. Loops over indexes, callsget_create_index_sql
for each. If SQL is returned, callrun_query()
for it; if nothing is returned, do nothing. All of this can be part ofdefault__create_indexes
; it's unlikely someone would need to change/override this, but you never know.get_create_index_sql
: Called bycreate_indexes
, one index at a time. Returns the database-specific SQL for creating that index.default__get_create_index_sql
should returnNone
(most databases don't have indexes), instead of raising an exception (what you have now).postgres__get_create_index_sql
should return the appropriate PostgreSQL (exactly as you have it now). An Oracle, SQLServer, or Materialize user out in the world could reimplement justget_create_index_sql
with the right SQL, and still benefit from all of the default plumbing increate_indexes
. (They'd also need to define a customIndexConfig
in the python portion of the adapter, but that's a different story.)
- You mentioned updating the incremental materialization as well. Do we also need to take care of the seed and snapshot materializations?
Yes!! Really good point, thank you for catching that. The seed
materialization logic will look a lot like the incremental (I think {% if full_refresh_mode or not exists_as_table %}
). I think the snapshot
materialization logic can be even simpler ({% if not target_relation_exists %}
), since there's no such thing as full-refreshing.
- If I do the previous point, I could add validation to ensure that type is one of [btree, hash, gist, gin]. Do you recommend that or is it just better to let the query run and possibly fail with an invalid type?
See comment below. I'll fully admit I'm not a Postgres pro; depending on how many index types there are, how customizable they are, and how frequently they change, it may make sense to validate in python (set in stone), in Jinja (standardized but can be overridden), or not at all.
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
1 similar comment
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
@jtcohen6 I've responded to all the comments and made sure that the existing tests still pass! I think it's time to add tests for the code added in this PR. |
Brilliant, the code is looking really good! I think a new integration test is appropriate,
Each of those can be something silly, And maybe also some failure cases:
When you get a chance, could you also sign the CLA, so I can make sure we'll be able to merge once ready? |
Apologies for the delay on this @jtcohen6; I'm on vacation and will be back on the 29th. Will get to this then! |
6db8945
to
f56f521
Compare
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
3 similar comments
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
|
@jtcohen6 Please take a look at the tests I added. I also updated the changelog. I haven't added any documentation or comments. Please let me know if I should. |
A couple more things:
|
@arzavj The tests are looking great! I haven't had a chance to do a deep dive, but I'm really happy to see everything you've done here. Quick things:
|
624cf1f
to
cacbd1c
Compare
You were right! Fixed the author of that culprit commit and also rebased on the latest develop. |
hey @jtcohen6! Bumping this up on your radar :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arzavj Sorry for the delay!
I love how this feature "just works," from the unique/type properties down to the guaranteed-unique index name. And the tests look amazing. Thank you so much for this contribution, and for all your thorough work here over the past few months!
This will be released in v0.20.0 :)
Thank you for the kind words @jtcohen6! Really appreciate all your help on this PR! Excited to update my code to use this new feature once v0.20.0 is released :) |
resolves #804
Description
Rough first pass at addressing the issue. Would love to get some thoughts on whether I'm on the right track. Would also love to get some help on adding a unit and integration test so that I can make sure that this code works.
I'm imaging that usage will look like:
This PR gives developers the ability to:
create_table_as
['column_a', 'column_b']
and['column_a', 'column_c']
; what happens is that postgres truncates the final index name to fit a certain max length; so in this example postgres might drop the second column from the index name causing both index names to be the same (and hence only one of them is created); of course certain conditions need to be met in order to run into this problem (i.e. shared initial columns and possibly long column names)Note that this does not support gin indexes and indexes on expressions of columns like
lower(column_a)
Also note that I'm assuming snakecased column and table names.
Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.