Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot Strategy invalidate_hard_deletes, column dbt_updated_at does not update after first execution #2935

Closed
sebastianvillarroel opened this issue Dec 4, 2020 · 1 comment
Labels
bug Something isn't working

Comments

@sebastianvillarroel
Copy link

Describe the bug

While testing the feature flag invalidate_hard_deletes, I noticed that the column dbt_updated_at has the same value as the column dbt_valid_from, which does not truly represent when the row was last updated.

Steps To Reproduce

  1. execute dbt snapshot
  2. Pick one row, modify the value of a column which you know included in the snapshot configuration
  3. execute dbt snapshot
  4. Delete the same row you changed in step 2.
  5. execute dbt snapshot

Expected behavior

I expect that the column dbt_update_at represents when dbt modified last time the record.

Screenshots and log output

This is the result of our testing
image

System information

Which database are you using dbt with?

  • [x ] redshift

The output of dbt --version:

installed version: 0.19.0-b1
   latest version: 0.18.1

Your version of dbt is ahead of the latest release!

Plugins:
  - redshift: 0.19.0b1
  - postgres: 0.19.0b1
  - bigquery: 0.19.0b1
  - snowflake: 0.19.0b1

The operating system you're using:

Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.1 LTS
Release:	20.04
Codename:	focal

The output of python --version:

Python 3.8.5
@sebastianvillarroel sebastianvillarroel added bug Something isn't working triage labels Dec 4, 2020
@jtcohen6 jtcohen6 removed the triage label Dec 4, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Dec 4, 2020

@sebastianvillarroel Thanks for the write-up!

Traditionally, snapshots record dbt_valid_from and dbt_updated_at when created. When it's time to update rows to mark them as "expired", the snapshot only change the dbt_valid_to field. We took the same approach for hard deletes: their dbt_valid_to would change to reflect that they're no longer valid, but the value of dbt_updated_at itself would not change.

In large part, this is because dbt_updated_at is for dbt's use, as a fallback mechanism in case a snapshot has switched strategies. (See this comment and #2350 for details.)

Here's what we have in the docs today:

Field Meaning Usage
dbt_valid_from The timestamp when this snapshot row was first inserted This column can be used to order the different "versions" of a record.
dbt_valid_to The timestamp when this row row became invalidated. The most recent snapshot record will have dbt_valid_to set to null.
dbt_scd_id A unique key generated for each snapshotted record. This is used internally by dbt
dbt_updated_at The updated_at timestamp of the source record when this snapshot row was inserted. This is used internally by dbt

As such, I'm hesitant to change the behavior here; I'd rather document and make clear that dbt_valid_to is the field to use for understanding when a hard-deleted record was marked missing, not dbt_updated_at.

I'm going to close this issue for now, but I'm open to hearing your disagreement! The code change involved here would be quite straightforward.

@jtcohen6 jtcohen6 closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants