Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full refresh of incremental tables locks the table during build #2426

Closed
1 of 5 tasks
drkarthi opened this issue May 11, 2020 · 16 comments · Fixed by #2998
Closed
1 of 5 tasks

Full refresh of incremental tables locks the table during build #2426

drkarthi opened this issue May 11, 2020 · 16 comments · Fixed by #2998
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! redshift
Milestone

Comments

@drkarthi
Copy link
Contributor

drkarthi commented May 11, 2020

Describe the bug

When running a full refresh of incremental tables, the table is locked when the new table is being built. This locks the table for several minutes in production for large tables.

Steps To Reproduce

  1. Run dbt with full refresh on an incremental table that takes a few minutes to build.
  2. Query the table when the full refresh of the table is happening.

Expected behavior

Expected the original table to be available for querying when the temporary table is being built. The table should be locked only when moving the contents from the temporary table to the original table.

Below is a prototype where only the delete from target table and the insert into the target table are part of the transaction. This locks the table for a much smaller duration:

{% call statement(auto_begin=True) %} 
        delete from {{ target_table }};
        insert into {{ target_table }} select * from {{ temporary_table }};
        commit;
 {% endcall %}

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

The operating system you're using:
macOS Catalina 10.15.4

The output of python --version:
Python 3.6.3

Additional context

When I checked in Slack, I was told this may be a solved problem. However, I do notice a significant difference in query time between the two approaches. Let me know if I am missing something.

@drkarthi drkarthi added bug Something isn't working triage labels May 11, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented May 11, 2020

I believe the relevant bit of code is here:
https:/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27

I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.

Here's the table order of operations:

  1. Create the new object as an intermediate_relation
  2. Rename target_relation --> backup_relation
  3. Rename intermediate_relation --> target_relation
  4. Close the transaction
  5. Drop the backup_relation

The full-refresh mode of the incremental materialization instead does:

  1. Rename target_relation --> backup_relation
  2. Create the new object as target_relation
  3. Close the transaction
  4. Drop the backup_relation

It's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.

@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?

@jtcohen6 jtcohen6 added redshift enhancement New feature or request and removed triage bug Something isn't working labels May 11, 2020
@drewbanin
Copy link
Contributor

hey @jtcohen6 - I can't think of a good reason why incrementals would work differently than tables on redshift. I'd be very supportive of changing the atomic swap flow for 0.18.0!

@drewbanin drewbanin added this to the Marian Anderson milestone May 12, 2020
@drkarthi
Copy link
Contributor Author

That's great! I am currently working on a custom materialization for our project using the pattern described in @jtcohen6 's response. I would be happy to contribute if that works!

@drkarthi
Copy link
Contributor Author

drkarthi commented May 18, 2020

As discussed above, changing to the table order of operations helped remove the lock on the table when building and resolved our issue. My change involved a custom materialization with the following lines of code:

https:/fishtown-analytics/dbt/blob/1d543b373734ab78e0ebf441888fb7d7c7d6e8fd/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L25-L26

to something like:

{% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
{%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                                      schema=this.schema,
                                                      database=this.database,
                                                      type='table') -%} 

{% do run_query(create_table_as(False, tmp_relation, sql)) %}
{% do adapter.rename_relation(target_relation, backup_relation) %}
{% do adapter.rename_relation(tmp_relation, target_relation) %}
-- define build_sql so that it does not fail
{% set build_sql = "select 1;" %}

Let me know if it would be helpful to contribute this in a PR or if it would be better for the team to make the change since it involves changes to an important functionality. There are also definitely improvements to be made like removing the dummy sql statements.

@drewbanin
Copy link
Contributor

@drkarthi sure thing - please do feel free to PR a change for this against dev/marian-anderson. If you have any questions about how to structure the code to remove the dummy sql statement, we'd be happy to take a look at the diff and send through some ideas :)

@jtcohen6 jtcohen6 added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Jul 6, 2020
@jtcohen6 jtcohen6 removed this from the Marian Anderson milestone Jul 6, 2020
@TAJD
Copy link

TAJD commented Jan 6, 2021

Has this made it into a release yet? I'm experiencing this issue with Postgres.

@kwigley
Copy link
Contributor

kwigley commented Jan 6, 2021

@TAJD This issue is still open and is not slated for a release (or 0.19 specifically) at the moment. Drew's comment above still applies though! We are happy to take a look at a PR contribution.

@TAJD
Copy link

TAJD commented Jan 6, 2021

@TAJD This issue is still open and is not slated for a release (or 0.19 specifically) at the moment. Drew's comment above still applies though! We are happy to take a look at a PR contribution.

Thanks for clarifying! I'll take a look.

@drkarthi
Copy link
Contributor Author

drkarthi commented Jan 8, 2021

Just saw that a commit has already been pushed, closing my PR and will follow the changes there :)

@drkarthi
Copy link
Contributor Author

drkarthi commented Jan 8, 2021

One additional thing I tried was removing the dummy sql statement {% set build_sql = "select 1;" %} and moving the call statement to the else block. It leads to the following error

Completed with 1 error and 0 warnings:

'NoneType' object has no attribute 'status'

This is the code change related to that

{% elif existing_relation.is_view or should_full_refresh() %}
      {#-- Make sure the backup doesn't exist so we don't encounter issues with the rename below #}
      {% set backup_identifier = existing_relation.identifier ~ "__dbt_backup" %}
      {% set backup_relation = existing_relation.incorporate(path={"identifier": backup_identifier}) %}
      {% do adapter.drop_relation(backup_relation) %}

      {% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
      {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                                          schema=this.schema,
                                                          database=this.database,
                                                          type='table') -%} 

      {% do run_query(create_table_as(False, tmp_relation, sql)) %}
      {% do adapter.rename_relation(target_relation, backup_relation) %}
      {% do adapter.rename_relation(tmp_relation, target_relation) %}
      {% do to_drop.append(backup_relation) %}
  {% else %}
      {% set tmp_relation = make_temp_relation(target_relation) %}
      {% do run_query(create_table_as(True, tmp_relation, sql)) %}
      {% do adapter.expand_target_column_types(
             from_relation=tmp_relation,
             to_relation=target_relation) %}
      {% set build_sql = incremental_upsert(tmp_relation, target_relation, unique_key=unique_key) %}
      {% call statement("main") %}
        {{ build_sql }}
      {% endcall %}
  {% endif %}

@TAJD
Copy link

TAJD commented Jan 11, 2021

Thanks @drkarthi. The existing test suite passes for the modification but I'm not sure about how to write a test to confirm that this behaviour occurs.

My thoughts are to either confirm various statements exist in the compiled SQL code or to run a check mid-build. I'm going to take a look through the existing DBT test suite to gain ideas - happy to receive input in the mean time!

A workaround which solved the use case in my situation was to write a macro that concurrently refreshed a matview, allowing for zero downtime (Our production, testing and analytics databases are the same thing ;).)

@drkarthi
Copy link
Contributor Author

drkarthi commented Jan 11, 2021

@TAJD I tested the changes using the approach of checking mid-build and it was working. Don't know of a way to formally test the changes.

@drkarthi
Copy link
Contributor Author

@TAJD have you had the chance to look at this? would it be useful if I reopened my PR and we can wait for feedback from the maintainers?

@TAJD
Copy link

TAJD commented Jan 27, 2021

@drkarthi please do - I'm sorry I lost focus on this after I got a fix in production. I ended up writing a macro to persist data in a separate table. Thanks for your effort!

@Limess
Copy link

Limess commented Apr 9, 2021

Hey @drkarthi, it looks like your PR had a small amount of feedback but was close to being ready to be merged. Any chance you could take a look? This issue is affecting us fairly regularly and it'd be great to see it resolved.

@drkarthi
Copy link
Contributor Author

drkarthi commented Apr 9, 2021

Sorry @Limess, I had lost track of this. I will address the pr comments this weekend.

@jtcohen6 jtcohen6 added this to the Margaret Mead milestone Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! redshift
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants