Postgres: ability to create indexes #3106

arzavj · 2021-02-16T05:51:04Z

resolves #804

Description

Rough first pass at addressing the issue. Would love to get some thoughts on whether I'm on the right track. Would also love to get some help on adding a unit and integration test so that I can make sure that this code works.

I'm imaging that usage will look like:

{{ config(
    indexes=[
      { 'columns': ['column_a', 'column_b'], 'unique': false },
      { 'columns': ['column_a', 'column_c'], 'unique': true },
    ]
) }}

This PR gives developers the ability to:

create multiple indexes inside the same transaction as the create_table_as
create a mix of unique and non-unique indexes
create indexes on any number of columns
address the issue where indexes aren't recreated (as discussed in this discourse) by prepending the index name with a timestamp; I choose to prepend instead of append because if the column names are long or there are a number of columns in the index then postgres might truncate the index name and drop the timestamp
address the issue where sometimes only one index out of a few indexes is created because the indexes share the same first few columns -- for example, ['column_a', 'column_b'] and ['column_a', 'column_c']; what happens is that postgres truncates the final index name to fit a certain max length; so in this example postgres might drop the second column from the index name causing both index names to be the same (and hence only one of them is created); of course certain conditions need to be met in order to run into this problem (i.e. shared initial columns and possibly long column names)

Note that this does not support gin indexes and indexes on expressions of columns like lower(column_a)
Also note that I'm assuming snakecased column and table names.

Checklist

I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

cla-bot · 2021-02-16T05:51:06Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

plugins/postgres/dbt/include/postgres/macros/adapters.sql

jtcohen6

Nice start here @arzavj, and thanks for taking so much into consideration for the first go!

Let's figure out the right handoff points between create_indexes and the materialization code. I don't think it makes sense to pass back a semicolon-delimited string and append it to post_hooks. I'm drawing inspiration from persist_docs, which feels like clear, compelling, and quite similar functionality.

Once we're settled on the syntax, we'll definitely need to add tests for the various permutations.

Also note that I'm assuming snakecased column and table names.

I think you're good! The model name, if quoted and weirdly cased, may "just work" because of how dbt prints {{ relation }}. If the columns are weirdly quoted/cased, the user will have to account for that when defining the indexes config.

core/dbt/include/global_project/macros/adapters/common.sql

plugins/postgres/dbt/include/postgres/macros/adapters.sql

core/dbt/include/global_project/macros/materializations/table/table.sql

cla-bot · 2021-03-01T00:25:10Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

cla-bot · 2021-03-01T00:26:35Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

arzavj · 2021-03-01T00:35:22Z

@jtcohen6 please take a look at the updated PR based on the comments above. Some open questions I have:

You mentioned "avoid any operation unless the model both has an index defined and is running on an adapter with an implementation of create_indexes". Is the approach I've take correct? How does one check to see whether an adapter has implemented get_create_index_sql?
You mentioned updating the incremental materialization as well. Do we also need to take care of the seed and snapshot materializations?
If I do the previous point, I could add validation to ensure that type is one of [btree, hash, gist, gin]. Do you recommend that or is it just better to let the query run and possibly fail with an invalid type?

plugins/postgres/dbt/include/postgres/macros/adapters.sql

core/dbt/include/global_project/macros/adapters/common.sql

core/dbt/include/global_project/macros/materializations/table/table.sql

jtcohen6

Nice work @arzavj, this is really coming along!

You mentioned "avoid any operation unless the model both has an index defined and is running on an adapter with an implementation of create_indexes". Is the approach I've take correct? How does one check to see whether an adapter has implemented get_create_index_sql?

If an adapter has implemented get_create_index_sql, dbt will use its version; otherwise, dbt will use default__get_create_index_sql.

As a feature, indexes are unique to Postgres among the "core four" databases, but they're pretty common out in the world of databases. That's why I think we should implement create_indexes in such a way that:

By default, it does nothing
At the same time, any database-agnostic plumbing is reusable (= included in the default implementation)
Any PostgreSQL-specific components are wrapped in the postgres__ implementation

The best way to accomplish this, to my mind, was two dispatch macros:

create_indexes: Called by materialization. Loops over indexes, calls get_create_index_sql for each. If SQL is returned, call run_query() for it; if nothing is returned, do nothing. All of this can be part of default__create_indexes; it's unlikely someone would need to change/override this, but you never know.
get_create_index_sql: Called by create_indexes, one index at a time. Returns the database-specific SQL for creating that index. default__get_create_index_sql should return None (most databases don't have indexes), instead of raising an exception (what you have now). postgres__get_create_index_sql should return the appropriate PostgreSQL (exactly as you have it now). An Oracle, SQLServer, or Materialize user out in the world could reimplement just get_create_index_sql with the right SQL, and still benefit from all of the default plumbing in create_indexes. (They'd also need to define a custom IndexConfig in the python portion of the adapter, but that's a different story.)

You mentioned updating the incremental materialization as well. Do we also need to take care of the seed and snapshot materializations?

Yes!! Really good point, thank you for catching that. The seed materialization logic will look a lot like the incremental (I think {% if full_refresh_mode or not exists_as_table %}). I think the snapshot materialization logic can be even simpler ({% if not target_relation_exists %}), since there's no such thing as full-refreshing.

If I do the previous point, I could add validation to ensure that type is one of [btree, hash, gist, gin]. Do you recommend that or is it just better to let the query run and possibly fail with an invalid type?

See comment below. I'll fully admit I'm not a Postgres pro; depending on how many index types there are, how customizable they are, and how frequently they change, it may make sense to validate in python (set in stone), in Jinja (standardized but can be overridden), or not at all.

plugins/postgres/dbt/adapters/postgres/impl.py

core/dbt/include/global_project/macros/materializations/table/table.sql

plugins/postgres/dbt/include/postgres/macros/adapters.sql

core/dbt/include/global_project/macros/adapters/common.sql

cla-bot · 2021-03-12T05:24:57Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

cla-bot · 2021-03-12T05:54:29Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

arzavj · 2021-03-12T06:16:05Z

@jtcohen6 I've responded to all the comments and made sure that the existing tests still pass! I think it's time to add tests for the code added in this PR.

jtcohen6 · 2021-03-16T11:58:10Z

Brilliant, the code is looking really good!

I think a new integration test is appropriate, 065_postgres_tests/test_indexes.py, that features:

a table model with indexes (run once)
an incremental model with indexes (run, run again, run --full-refresh)
a seed with indexes (seed, seed again, seed --full-refresh)
a snapshot with indexes (snapshot twice)

Each of those can be something silly, select 1 as fun or similar, as long as the defined indexes are valid.

And maybe also some failure cases:

Invalid index config (assert it raises the exceptions you've defined)
Index with a non-existing type (assert it fails / returns a database error)

When you get a chance, could you also sign the CLA, so I can make sure we'll be able to merge once ready?

arzavj · 2021-03-25T16:56:41Z

Apologies for the delay on this @jtcohen6; I'm on vacation and will be back on the 29th. Will get to this then!

cla-bot · 2021-04-06T02:40:12Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

cla-bot · 2021-04-11T02:40:53Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

cla-bot · 2021-04-11T23:52:38Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

cla-bot · 2021-04-11T23:56:59Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Arzav Jain.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https:/settings/emails

arzavj · 2021-04-12T00:05:40Z

@jtcohen6 Please take a look at the tests I added. I also updated the changelog. I haven't added any documentation or comments. Please let me know if I should.

arzavj · 2021-04-12T00:22:30Z

A couple more things:

The unit tests are failing however when I run the same command python -m pytest -v test/unit locally they pass (with 1 skip). Weirdly though when I run make test I get ERROR: InterpreterNotFound: python3.8; not sure why it's trying to use python 3.8 when I'm on python 3.7.10.
I'm struggling to figure out why the cla-bot is unable to recognize me. My git config has the correct email address as output by git config --list | grep email. This matches my primary Github email address.

jtcohen6 · 2021-04-12T15:13:37Z

@arzavj The tests are looking great! I haven't had a chance to do a deep dive, but I'm really happy to see everything you've done here.

Quick things:

That Windows unit test has been failing on all PRs, and it's on us to fix it (Windows unit test consistently (or frequently?) failing #3233). In the meantime, we just merged a PR that disabled it for now, so if you pull from develop it should be skipped on the next one.
I'm not positive, but think the issue is that one of your commits (2e5ff40) is co-authored by your GitHub account, and by... also your GitHub account? I'm not sure:

arzavj · 2021-04-12T16:40:23Z

You were right! Fixed the author of that culprit commit and also rebased on the latest develop.

arzavj · 2021-04-21T16:17:05Z

hey @jtcohen6! Bumping this up on your radar :)

jtcohen6

@arzavj Sorry for the delay!

I love how this feature "just works," from the unique/type properties down to the guaranteed-unique index name. And the tests look amazing. Thank you so much for this contribution, and for all your thorough work here over the past few months!

This will be released in v0.20.0 :)

arzavj · 2021-04-27T15:23:29Z

Thank you for the kind words @jtcohen6! Really appreciate all your help on this PR! Excited to update my code to use this new feature once v0.20.0 is released :)

arzavj marked this pull request as draft February 16, 2021 05:54

brylie reviewed Feb 17, 2021

View reviewed changes

plugins/postgres/dbt/include/postgres/macros/adapters.sql Outdated Show resolved Hide resolved

jtcohen6 mentioned this pull request Feb 22, 2021

Support for postgres index creation? #804

Closed

jtcohen6 reviewed Feb 22, 2021

View reviewed changes

arzavj commented Mar 1, 2021

View reviewed changes

plugins/postgres/dbt/include/postgres/macros/adapters.sql Outdated Show resolved Hide resolved

arzavj commented Mar 1, 2021

View reviewed changes

core/dbt/include/global_project/macros/adapters/common.sql Outdated Show resolved Hide resolved

arzavj commented Mar 1, 2021

View reviewed changes

core/dbt/include/global_project/macros/adapters/common.sql Show resolved Hide resolved

arzavj commented Mar 1, 2021

View reviewed changes

core/dbt/include/global_project/macros/materializations/table/table.sql Show resolved Hide resolved

jtcohen6 reviewed Mar 4, 2021

View reviewed changes

arzavj force-pushed the postgres_create_indexes branch from 6db8945 to f56f521 Compare April 6, 2021 02:40

arzavj marked this pull request as ready for review April 12, 2021 00:06

arzavj added 3 commits April 12, 2021 09:37

first pass

be70b1a

Respond to pr comments; approach inspired by persist_docs()

40c02d2

remove no longer used truncate_string macro

df7cc05

arzavj and others added 5 commits April 12, 2021 09:37

respond to pr comments

118973c

fix style to get unit tests to pass

4f0968d

basic tests for table materialization

aa65b01

more tests

3f78bb7

Updated Changelog.md

cacbd1c

arzavj force-pushed the postgres_create_indexes branch from 624cf1f to cacbd1c Compare April 12, 2021 16:39

cla-bot bot added the cla:yes label Apr 12, 2021

Merge branch 'develop' into postgres_create_indexes

5097975

jtcohen6 approved these changes Apr 27, 2021

View reviewed changes

jtcohen6 merged commit a260d4e into dbt-labs:develop Apr 27, 2021

jtcohen6 mentioned this pull request Apr 27, 2021

reference schema name when creating index dbt-labs/postgres#3

Open

dataders mentioned this pull request Oct 5, 2021

adopt dbt-postgres's method of index creation dbt-msft/dbt-sqlserver#163

Open

cleare-cl mentioned this pull request Mar 1, 2022

[CT-306] [Bug] Indexes config does not create indexes #4808

Closed

1 task

morsapaes mentioned this pull request Apr 26, 2022

dbt-materialize: allow additional name param in index config MaterializeInc/materialize#12030

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres: ability to create indexes #3106

Postgres: ability to create indexes #3106

arzavj commented Feb 16, 2021 •

edited

Loading

cla-bot bot commented Feb 16, 2021

jtcohen6 left a comment •

edited

Loading

cla-bot bot commented Mar 1, 2021

cla-bot bot commented Mar 1, 2021

arzavj commented Mar 1, 2021 •

edited

Loading

jtcohen6 left a comment

cla-bot bot commented Mar 12, 2021

cla-bot bot commented Mar 12, 2021

arzavj commented Mar 12, 2021 •

edited

Loading

jtcohen6 commented Mar 16, 2021 •

edited

Loading

arzavj commented Mar 25, 2021

cla-bot bot commented Apr 6, 2021

cla-bot bot commented Apr 11, 2021

cla-bot bot commented Apr 11, 2021

cla-bot bot commented Apr 11, 2021

arzavj commented Apr 12, 2021

arzavj commented Apr 12, 2021

jtcohen6 commented Apr 12, 2021

arzavj commented Apr 12, 2021

arzavj commented Apr 21, 2021

jtcohen6 left a comment

arzavj commented Apr 27, 2021

Postgres: ability to create indexes #3106

Postgres: ability to create indexes #3106

Conversation

arzavj commented Feb 16, 2021 • edited Loading

Description

Checklist

cla-bot bot commented Feb 16, 2021

jtcohen6 left a comment • edited Loading

Choose a reason for hiding this comment

cla-bot bot commented Mar 1, 2021

cla-bot bot commented Mar 1, 2021

arzavj commented Mar 1, 2021 • edited Loading

jtcohen6 left a comment

Choose a reason for hiding this comment

cla-bot bot commented Mar 12, 2021

cla-bot bot commented Mar 12, 2021

arzavj commented Mar 12, 2021 • edited Loading

jtcohen6 commented Mar 16, 2021 • edited Loading

arzavj commented Mar 25, 2021

cla-bot bot commented Apr 6, 2021

cla-bot bot commented Apr 11, 2021

cla-bot bot commented Apr 11, 2021

cla-bot bot commented Apr 11, 2021

arzavj commented Apr 12, 2021

arzavj commented Apr 12, 2021

jtcohen6 commented Apr 12, 2021

arzavj commented Apr 12, 2021

arzavj commented Apr 21, 2021

jtcohen6 left a comment

Choose a reason for hiding this comment

arzavj commented Apr 27, 2021

arzavj commented Feb 16, 2021 •

edited

Loading

jtcohen6 left a comment •

edited

Loading

arzavj commented Mar 1, 2021 •

edited

Loading

arzavj commented Mar 12, 2021 •

edited

Loading

jtcohen6 commented Mar 16, 2021 •

edited

Loading