Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2701] [Bug] Snapshot dbt_updated_at not updated #7869

Open
2 tasks done
Tracked by #10151
patkearns10 opened this issue Jun 14, 2023 · 8 comments
Open
2 tasks done
Tracked by #10151

[CT-2701] [Bug] Snapshot dbt_updated_at not updated #7869

patkearns10 opened this issue Jun 14, 2023 · 8 comments
Labels
snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation

Comments

@patkearns10
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

This was brought up by a customer and I am kinda surprised I have never heard this before.

Our docs show

two fields in the existing row should be updated when the updated_at timestamp is changed signifying a change of some value, the dbt_valid_to and the dbt_updated_at

image

Looks like only one update is happening and the dbt_updated_at does not get updated.

Screen Shot 2023-06-14 at 2 06 07 PM

Expected Behavior

According to the docs, the older row should have two fields updated:
dbt_valid_to
dbt_updated_at

Steps To Reproduce

  • create snapshot with timestamp check
  • add record 1 as id, '2023-01-01' as updated_at
  • run snapshot
  • modify record 1 as id, '2023-01-02' as updated_at
  • run snapshot
  • check the records in the output and see if dbt_updated_at has been modified in the original row

Relevant log output

No response

Environment

- OS:
- Python:
- dbt: tested 1.1 and 1.4

Which database adapter are you using with dbt?

No response

Additional Context

No response

@patkearns10 patkearns10 added bug Something isn't working triage labels Jun 14, 2023
@github-actions github-actions bot changed the title [Bug] Snapshot dbt_updated_at not updated [CT-2701] [Bug] Snapshot dbt_updated_at not updated Jun 14, 2023
@dbeatty10 dbeatty10 added the snapshots Issues related to dbt's snapshot functionality label Jun 14, 2023
@dbeatty10
Copy link
Contributor

dbeatty10 commented Jun 15, 2023

Thanks for opening this @patkearns10!

See below for the reproducible example I'm using (with dbt-postgres) that shows the same thing that you are reporting.

But the thing I'm wondering: is this actually a problem that needs to be solved in the implementation of snapshots that use the timestamp strategy? Or would updating the documentation suffice?

Reprex

snapshots/my_snapshot.sql

{% snapshot my_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='id',
      strategy='timestamp',
      updated_at='updated_at',
    )
}}

{% if var("my_data_version", 1) == 1 %}

    select 1 as id, '2023-01-01'::timestamp as updated_at

{% else %}

    select 1 as id, '2023-01-02'::timestamp as updated_at

{% endif %}

{% endsnapshot %}

Take the 1st snapshot:

dbt snapshot --vars '{"my_data_version": 1}'
dbt show --inline "select * from {{ ref('my_snapshot') }}"
id updated_at dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
1 2023-01-01 00:00:00 5aec37f35c393a7ef... 2023-01-01 00:00:00 2023-01-01 00:00:00

Take a 2nd snapshot:

dbt snapshot --vars '{"my_data_version": 2}'
dbt show --inline "select * from {{ ref('my_snapshot') }}"
id updated_at dbt_scd_id dbt_updated_at dbt_valid_from dbt_valid_to
1 2023-01-01 00:00:00 5aec37f35c393a7ef... 2023-01-01 00:00:00 2023-01-01 00:00:00 2023-01-02 00:00:00
1 2023-01-02 00:00:00 8ff9567d22a84a2c9... 2023-01-02 00:00:00 2023-01-02 00:00:00

Here's the default implementation, which we can see only updates the dbt_valid_to (but not dbt_updated_at) for rows that are no longer valid:

when matched
and DBT_INTERNAL_DEST.dbt_valid_to is null
and DBT_INTERNAL_SOURCE.dbt_change_type in ('update', 'delete')
then update
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to

Here's the associated implementation for dbt-postgres:

update {{ target }}
set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
from {{ source }} as DBT_INTERNAL_SOURCE
where DBT_INTERNAL_SOURCE.dbt_scd_id::text = {{ target }}.dbt_scd_id::text
and DBT_INTERNAL_SOURCE.dbt_change_type::text in ('update'::text, 'delete'::text)
and {{ target }}.dbt_valid_to is null;

@patkearns10
Copy link
Author

If we do nothing, we should update the documentation.
If we want it to be, in my opinion, correct, then we should update the code and not update the documentation.

In my opinion it does not make sense that the dbt_valid_to would be greater timestamp than dbt_updated_at. The record was updated, so it should have a dbt_updated_at at that time.

I do understand this is changing something that has been around for 4 years, so there is probably some concern there.

@dbeatty10
Copy link
Contributor

Agreed that it's been implemented this way for 4+ years and it might do more harm than good to change it. At the very least, we'd need to understand all the implications extremely thoroughly before considering making such a change.

What row-level data point would a user want that they can't get right now?

Right now, it looks like dbt_updated_at basically means "created at" (when using the check strategy) and "earliest valid time" otherwise. If a user wants the latest time at which the row was "modified at", then this might do the trick:

coalesce(`dbt_valid_to`, `dbt_valid_from`)

If you are always looking for a "system time" (a la SQL:2011), then that is not possible with the timestamp strategy, because that information is not always* captured and stored -- it is only captured and stored for the check strategy.

* When using invalidate_hard_deletes and a key is hard-deleted, dbt_valid_to will be a "system time" rather than an "application time".

I'm leaning heavily towards leaving the implementation as-is and updating the documentation.

We have a listing of gotchas related to snapshots listed here, and the situation you are bringing up looks like it would another good entry for that list.

@dbeatty10 dbeatty10 added awaiting_response user docs [docs.getdbt.com] Needs better documentation and removed triage labels Jun 15, 2023
@dbeatty10
Copy link
Contributor

@patkearns10 Thanks again for noticing this behavior and posting an elucidating write-up 🤩 Since there would be more risk than reward to change the behavior, I'm going to close this as wont_fix. I've opened a PR to update the docs in dbt-labs/docs.getdbt.com#3540

An option in user space

If a user needs/wants the data to be populated differently, they can override the key macro in their dbt project like this (if they are using Postgres):

{% macro postgres__snapshot_merge_sql(target, source, insert_cols) -%}
    {%- set insert_cols_csv = insert_cols | join(', ') -%}

    update {{ target }}
    set dbt_valid_to = DBT_INTERNAL_SOURCE.dbt_valid_to
    from {{ source }} as DBT_INTERNAL_SOURCE
    where DBT_INTERNAL_SOURCE.dbt_scd_id::text = {{ target }}.dbt_scd_id::text
      and DBT_INTERNAL_SOURCE.dbt_change_type::text in ('update'::text, 'delete'::text)
      and {{ target }}.dbt_valid_to is null;

    insert into {{ target }} ({{ insert_cols_csv }})
    select {% for column in insert_cols -%}
        DBT_INTERNAL_SOURCE.{{ column }} {%- if not loop.last %}, {%- endif %}
    {%- endfor %}
    from {{ source }} as DBT_INTERNAL_SOURCE
    where DBT_INTERNAL_SOURCE.dbt_change_type::text = 'insert'::text;
{% endmacro %}

The logic to use would vary by adapter. If an adapter doesn't provide their own override of snapshot_merge_sql, then it use the default logic here, and that would be the code to copy-paste-modify into their own dbt project.

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 15, 2023
@dbeatty10 dbeatty10 added the wontfix Not a bug or out of scope for dbt-core label Jun 15, 2023
dbeatty10 added a commit to dbt-labs/docs.getdbt.com that referenced this issue Jun 15, 2023
[Preview](https://deploy-preview-3540--docs-getdbt-com.netlify.app/docs/build/snapshots#snapshot-meta-fields)

## What are you changing in this pull request and why?
The values in the examples for `dbt_updated_at` weren't an accurate
portrayal.

See dbt-labs/dbt-core#7869 for context.

### 🎩 

<img width="400" alt="image"
src="https:/dbt-labs/docs.getdbt.com/assets/44704949/04170f33-3317-4d66-86f4-ce166bfa1b50">

Formatting of some timestamps was fixed as well:

<img width="200" alt="image"
src="https:/dbt-labs/docs.getdbt.com/assets/44704949/7bd71f6d-c9b3-4c5a-974a-044d8f0a2773">

## Checklist
- [x] Review the [Content style
guide](https:/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
and [About
versioning](https:/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
so my content adheres to these guidelines.
@dbeatty10 dbeatty10 removed the triage label Jun 15, 2023
@patkearns10
Copy link
Author

patkearns10 commented Jun 16, 2023

Right on, I understand the sentiment about not wanting to update the code.

Your workaround is precisely how I tested as well:
Screen Shot 2023-06-16 at 10 37 31 AM

@ntsteph
Copy link

ntsteph commented Sep 20, 2023

hi @patkearns10 and @dbeatty10,

thanks for this post, i had the same issue and fix this that way.

however by overriding the macro snapshot_merge_sql in he dbt project, it looks like this affect all existing strategies (timestamp, check, custom strategy created).

Is it possible to change the default behaviour of the macro only for one strategy (for example the custom strategy created) and not the others in he dbt project ?

Thanks in advance

@dbeatty10
Copy link
Contributor

@ntsteph it sounds like you've created a custom snapshot strategy, and you're wondering if you can change the default behavior of snapshot_merge_sql for just that custom snapshot strategy?

If so, you'd need to override the definition of the snapshot materialization for your adapter (here for dbt-postgres). So although it's possible, there's no "easy" way to override the logic for snapshot_merge_sql for a single snapshot strategy (but not the others).

@graciegoheen graciegoheen reopened this May 28, 2024
@graciegoheen graciegoheen removed wontfix Not a bug or out of scope for dbt-core bug Something isn't working labels May 28, 2024
@graciegoheen
Copy link
Contributor

graciegoheen commented May 29, 2024

From snapshot discussion:

Problem 7: dbt_updated_at is a "system time" for the check strategy (a la SQL:2011), but it is an "application time" for the timestamp strategy.

  • 💡 Feature idea 9: What if dbt either added a new column named dbt_modified_at or actually updated dbt_updated_at at the same time as dbt_valid_to?
  • 💡 Feature idea 10: What if dbt treated dbt_modified_at/dbt_updated_at as a "system time" for both snapshot strategies (since dbt_valid_from & dbt_valid_to represent the "application time" for both strategies)?
  • 💡 Feature idea 11: To complement dbt_modified_at, what if dbt added an immutable dbt_created_at column (representing the "system time" that the row was inserted)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
snapshots Issues related to dbt's snapshot functionality user docs [docs.getdbt.com] Needs better documentation
Projects
None yet
Development

No branches or pull requests

4 participants