Full refresh of incremental tables locks the table during build #2426

drkarthi · 2020-05-11T18:12:31Z

Describe the bug

When running a full refresh of incremental tables, the table is locked when the new table is being built. This locks the table for several minutes in production for large tables.

Steps To Reproduce

Run dbt with full refresh on an incremental table that takes a few minutes to build.
Query the table when the full refresh of the table is happening.

Expected behavior

Expected the original table to be available for querying when the temporary table is being built. The table should be locked only when moving the contents from the temporary table to the original table.

Below is a prototype where only the delete from target table and the insert into the target table are part of the transaction. This locks the table for a much smaller duration:

{% call statement(auto_begin=True) %} 
        delete from {{ target_table }};
        insert into {{ target_table }} select * from {{ temporary_table }};
        commit;
 {% endcall %}

System information

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.16.0
   latest version: 0.16.1

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

The operating system you're using:
macOS Catalina 10.15.4

The output of python --version:
Python 3.6.3

Additional context

When I checked in Slack, I was told this may be a solved problem. However, I do notice a significant difference in query time between the two approaches. Let me know if I am missing something.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-05-11T19:18:00Z

I believe the relevant bit of code is here:
https:/fishtown-analytics/dbt/blob/8686ab9a9ddffb63b64d3ee2861f132bf77f54c1/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L20-L27

I think the cause of the delay you're seeing is that dbt doesn't use the same approach here that it does in the table materialization.

Here's the table order of operations:

Create the new object as an intermediate_relation
Rename target_relation --> backup_relation
Rename intermediate_relation --> target_relation
Close the transaction
Drop the backup_relation

The full-refresh mode of the incremental materialization instead does:

Rename target_relation --> backup_relation
Create the new object as target_relation
Close the transaction
Drop the backup_relation

It's worth saying that, in much older versions of dbt, the way the incremental materialization executed full-refresh runs was by... dropping the preexisting table before even opening the transaction to recreate it. So this is definitely better than that, if still not as good as it could be.

@drewbanin Do you know a good reason for the difference between the materializations? Maybe this is something we could think about for broader materialization work in 0.18.0?

drewbanin · 2020-05-12T16:59:34Z

hey @jtcohen6 - I can't think of a good reason why incrementals would work differently than tables on redshift. I'd be very supportive of changing the atomic swap flow for 0.18.0!

drkarthi · 2020-05-14T13:18:33Z

That's great! I am currently working on a custom materialization for our project using the pattern described in @jtcohen6 's response. I would be happy to contribute if that works!

drkarthi · 2020-05-18T17:56:08Z

As discussed above, changing to the table order of operations helped remove the lock on the table when building and resolved our issue. My change involved a custom materialization with the following lines of code:

https:/fishtown-analytics/dbt/blob/1d543b373734ab78e0ebf441888fb7d7c7d6e8fd/core/dbt/include/global_project/macros/materializations/incremental/incremental.sql#L25-L26

to something like:

{% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
{%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                                      schema=this.schema,
                                                      database=this.database,
                                                      type='table') -%} 

{% do run_query(create_table_as(False, tmp_relation, sql)) %}
{% do adapter.rename_relation(target_relation, backup_relation) %}
{% do adapter.rename_relation(tmp_relation, target_relation) %}
-- define build_sql so that it does not fail
{% set build_sql = "select 1;" %}

Let me know if it would be helpful to contribute this in a PR or if it would be better for the team to make the change since it involves changes to an important functionality. There are also definitely improvements to be made like removing the dummy sql statements.

drewbanin · 2020-05-21T17:37:52Z

@drkarthi sure thing - please do feel free to PR a change for this against dev/marian-anderson. If you have any questions about how to structure the code to remove the dummy sql statement, we'd be happy to take a look at the diff and send through some ideas :)

dbt-labs#2426)

TAJD · 2021-01-06T18:44:16Z

Has this made it into a release yet? I'm experiencing this issue with Postgres.

kwigley · 2021-01-06T19:08:55Z

@TAJD This issue is still open and is not slated for a release (or 0.19 specifically) at the moment. Drew's comment above still applies though! We are happy to take a look at a PR contribution.

TAJD · 2021-01-06T19:10:03Z

@TAJD This issue is still open and is not slated for a release (or 0.19 specifically) at the moment. Drew's comment above still applies though! We are happy to take a look at a PR contribution.

Thanks for clarifying! I'll take a look.

drkarthi · 2021-01-08T15:38:33Z

Just saw that a commit has already been pushed, closing my PR and will follow the changes there :)

drkarthi · 2021-01-08T15:51:39Z

One additional thing I tried was removing the dummy sql statement {% set build_sql = "select 1;" %} and moving the call statement to the else block. It leads to the following error

Completed with 1 error and 0 warnings:

'NoneType' object has no attribute 'status'

This is the code change related to that

{% elif existing_relation.is_view or should_full_refresh() %}
      {#-- Make sure the backup doesn't exist so we don't encounter issues with the rename below #}
      {% set backup_identifier = existing_relation.identifier ~ "__dbt_backup" %}
      {% set backup_relation = existing_relation.incorporate(path={"identifier": backup_identifier}) %}
      {% do adapter.drop_relation(backup_relation) %}

      {% set tmp_identifier = this.identifier ~ "__dbt_tmp" %}
      {%- set tmp_relation = api.Relation.create(identifier=tmp_identifier,
                                                          schema=this.schema,
                                                          database=this.database,
                                                          type='table') -%} 

      {% do run_query(create_table_as(False, tmp_relation, sql)) %}
      {% do adapter.rename_relation(target_relation, backup_relation) %}
      {% do adapter.rename_relation(tmp_relation, target_relation) %}
      {% do to_drop.append(backup_relation) %}
  {% else %}
      {% set tmp_relation = make_temp_relation(target_relation) %}
      {% do run_query(create_table_as(True, tmp_relation, sql)) %}
      {% do adapter.expand_target_column_types(
             from_relation=tmp_relation,
             to_relation=target_relation) %}
      {% set build_sql = incremental_upsert(tmp_relation, target_relation, unique_key=unique_key) %}
      {% call statement("main") %}
        {{ build_sql }}
      {% endcall %}
  {% endif %}

TAJD · 2021-01-11T08:12:01Z

Thanks @drkarthi. The existing test suite passes for the modification but I'm not sure about how to write a test to confirm that this behaviour occurs.

My thoughts are to either confirm various statements exist in the compiled SQL code or to run a check mid-build. I'm going to take a look through the existing DBT test suite to gain ideas - happy to receive input in the mean time!

A workaround which solved the use case in my situation was to write a macro that concurrently refreshed a matview, allowing for zero downtime (Our production, testing and analytics databases are the same thing ;).)

drkarthi · 2021-01-11T20:00:34Z

@TAJD I tested the changes using the approach of checking mid-build and it was working. Don't know of a way to formally test the changes.

drkarthi · 2021-01-26T23:21:41Z

@TAJD have you had the chance to look at this? would it be useful if I reopened my PR and we can wait for feedback from the maintainers?

TAJD · 2021-01-27T08:41:27Z

@drkarthi please do - I'm sorry I lost focus on this after I got a fix in production. I ended up writing a macro to persist data in a separate table. Thanks for your effort!

Limess · 2021-04-09T11:41:18Z

Hey @drkarthi, it looks like your PR had a small amount of feedback but was close to being ready to be merged. Any chance you could take a look? This issue is affecting us fairly regularly and it'd be great to see it resolved.

drkarthi · 2021-04-09T16:40:13Z

Sorry @Limess, I had lost track of this. I will address the pr comments this weekend.

drkarthi added bug Something isn't working triage labels May 11, 2020

jtcohen6 added redshift enhancement New feature or request and removed triage bug Something isn't working labels May 11, 2020

drewbanin added this to the Marian Anderson milestone May 12, 2020

jtcohen6 added the good_first_issue Straightforward + self-contained changes, good for new contributors! label Jul 6, 2020

jtcohen6 removed this from the Marian Anderson milestone Jul 6, 2020

drkarthi pushed a commit to drkarthi/dbt that referenced this issue Aug 31, 2020

prevent locks during full refresh in incremental materialization (fixes

eaeea7c

dbt-labs#2426)

drkarthi mentioned this issue Jan 8, 2021

prevent locks during full refresh in incremental materialization (fix… #2998

Merged

4 tasks

jtcohen6 added this to the Margaret Mead milestone Apr 13, 2021

jtcohen6 closed this as completed in #2998 May 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full refresh of incremental tables locks the table during build #2426

Full refresh of incremental tables locks the table during build #2426

drkarthi commented May 11, 2020 •

edited

Loading

jtcohen6 commented May 11, 2020 •

edited

Loading

drewbanin commented May 12, 2020

drkarthi commented May 14, 2020

drkarthi commented May 18, 2020 •

edited

Loading

drewbanin commented May 21, 2020

TAJD commented Jan 6, 2021

kwigley commented Jan 6, 2021

TAJD commented Jan 6, 2021

drkarthi commented Jan 8, 2021 •

edited

Loading

drkarthi commented Jan 8, 2021 •

edited

Loading

TAJD commented Jan 11, 2021

drkarthi commented Jan 11, 2021 •

edited

Loading

drkarthi commented Jan 26, 2021

TAJD commented Jan 27, 2021 •

edited

Loading

Limess commented Apr 9, 2021

drkarthi commented Apr 9, 2021

Full refresh of incremental tables locks the table during build #2426

Full refresh of incremental tables locks the table during build #2426

Comments

drkarthi commented May 11, 2020 • edited Loading

Describe the bug

Steps To Reproduce

Expected behavior

System information

Additional context

jtcohen6 commented May 11, 2020 • edited Loading

drewbanin commented May 12, 2020

drkarthi commented May 14, 2020

drkarthi commented May 18, 2020 • edited Loading

drewbanin commented May 21, 2020

TAJD commented Jan 6, 2021

kwigley commented Jan 6, 2021

TAJD commented Jan 6, 2021

drkarthi commented Jan 8, 2021 • edited Loading

drkarthi commented Jan 8, 2021 • edited Loading

TAJD commented Jan 11, 2021

drkarthi commented Jan 11, 2021 • edited Loading

drkarthi commented Jan 26, 2021

TAJD commented Jan 27, 2021 • edited Loading

Limess commented Apr 9, 2021

drkarthi commented Apr 9, 2021

drkarthi commented May 11, 2020 •

edited

Loading

jtcohen6 commented May 11, 2020 •

edited

Loading

drkarthi commented May 18, 2020 •

edited

Loading

drkarthi commented Jan 8, 2021 •

edited

Loading

drkarthi commented Jan 8, 2021 •

edited

Loading

drkarthi commented Jan 11, 2021 •

edited

Loading

TAJD commented Jan 27, 2021 •

edited

Loading