Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Timestamps are inconsistent between inserted/updated and deleted rows in snapshots #4347

Closed
1 task done
pcasteran opened this issue Nov 26, 2021 · 5 comments · Fixed by #4513
Closed
1 task done
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality

Comments

@pcasteran
Copy link
Contributor

pcasteran commented Nov 26, 2021

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

During the execution of a snapshot, the timestamp used for the dbt_updated_at, dbt_valid_from and dbt_valid_to columns is not consistent between the inserted, updated and deleted rows.

For the inserted and updated rows the code is {{ strategy.updated_at }} which, in BigQuery at least, is calculated beforehand in another query and is equal to CURRENT_TIMESTAMP() in that previous query. The value is then expanded in the actual snapshot query:

insertions_source_data as (
        select
            *,
            xyz as dbt_unique_key,
            TIMESTAMP("2021-11-25 11:25:01.151693+00:00") as dbt_updated_at,
            TIMESTAMP("2021-11-25 11:25:01.151693+00:00") as dbt_valid_from,
            nullif(TIMESTAMP("2021-11-25 11:25:01.151693+00:00"), TIMESTAMP("2021-11-25 11:25:01.151693+00:00")) as dbt_valid_to,
            to_hex(md5(concat(coalesce(cast(xyz as string), ''), '|',coalesce(cast(TIMESTAMP("2021-11-25 11:25:01.151693+00:00") as string), '')))) as dbt_scd_id

        from snapshot_query
    ),

On the other hand for the deleted rows, the code is : {{ snapshot_get_time() }} which is expanded as CURRENT_TIMESTAMP() in the main query:

deletes as (
        select
            'delete' as dbt_change_type,
            source_data.*,
            CURRENT_TIMESTAMP() as dbt_valid_from,
            CURRENT_TIMESTAMP() as dbt_updated_at,
            CURRENT_TIMESTAMP() as dbt_valid_to,
            snapshotted_data.dbt_scd_id
    
        from snapshotted_data
        left join deletes_source_data as source_data on snapshotted_data.dbt_unique_key = source_data.dbt_unique_key
        where source_data.dbt_unique_key is null
    )

As there are multiple queries involved in the execution of a snapshot, computed in different point in time, the values returned by the two calls of CURRENT_TIMESTAMP() are different and leads to different timestamps being used in the output.

Expected Behavior

A single execution of a snapshot should output the same "current timestamp" value for the inserted, updated and deleted rows.

I guess the fix should be to use {{ strategy.updated_at }} instead of {{ snapshot_get_time() }} when processing the deleted rows.

Steps To Reproduce

No response

Relevant log output

No response

Environment

No response

What database are you using dbt with?

bigquery

Additional Context

No response

@pcasteran pcasteran added bug Something isn't working triage labels Nov 26, 2021
@jtcohen6
Copy link
Contributor

@pcasteran Thanks for opening!

At first, I thought this issue might be the same as #3710 (data type mismatches between datetime and timestamp), but this is indeed a different thing.

Okay, let's dive in.

Snapshots on BigQuery are basically the same as they are elsewhere. The dbt-bigquery plugin just reimplements a few macros to handle syntactical idiosyncrasies.

The snapshot strategy determines the value of {{ strategy.updated_at }}. In the event that you're using the timestamp strategy, it's the actual value from the updated_at column. If you're using the check strategy, it's the value of {{ snapshot_get_time() }}, as run and saved in an earlier query:

{% set select_current_time -%}
select {{ snapshot_get_time() }} as snapshot_start
{%- endset %}
{#-- don't access the column by name, to avoid dealing with casing issues on snowflake #}
{%- set now = run_query(select_current_time)[0][0] -%}
{% if now is none or now is undefined -%}
{%- do exceptions.raise_compiler_error('Could not get a snapshot start time from the database') -%}
{%- endif %}
{% set updated_at = config.get('updated_at', snapshot_string_as_time(now)) %}

Why did we choose to do it that werid way a few years ago? It looks like it was to ensure that there was no overlap between a new record's valid_from and a previous record's valid_to (#1736, #1994, slack thread), especially on databases that require different queries for update + insert (versus a single atomic merge).

In fact, I think we rendered this workaround obsolete, by creating a single atomic staging table in #2390 for use with either a single atomic merge or a transactional update + insert. As such, this behavior is no longer necessary.

So, I think we could take either of these approaches:

  • Lower lift: The strategy adds a new field to its returned dictionary, strategy.hard_delete_timestamp, that's the same as strategy.updated_at (if check strategy) or just snapshot_get_time() (if timestamp). The staging table query uses strategy.hard_delete_timestamp here.
  • Better resolution: We should undo the weird run-and-store behavior of the check strategy, since the advent of a single snapshot_staging_table makes it unnecessary. Instead, the check strategy should just return snapshot_get_time() as its updated_at. Then, it will end up being the same as the timestamp used for hard deletes.

I think this could be a good first issue. Is it something you'd be interested in working on?

@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality and removed triage labels Nov 29, 2021
@kadero
Copy link
Contributor

kadero commented Dec 3, 2021

Hello @jtcohen6 @pcasteran 👋,

Happy to take this one if it's ok for you ?

Thanks 🙏

@pcasteran
Copy link
Contributor Author

@kadero that's fine for me if you want to work on this one, thx :)

@dbeatty10
Copy link
Contributor

⚠️ The merged pull request (#4513) includes a breaking change. See below for details.

❓ Should this ticket be re-opened or should I create a new issue?

Proposed solution

Converting this to a non-breaking change might be as simple as:

-    {% set updated = snapshot_get_time() %}
+    {% set updated_at = config.get('updated_at', snapshot_get_time()) %}

Definitions

  • point-in-time snapshot - a stand-alone copy of a database table as it was at a particular moment
  • dbt snapshot - a process that can merge a data source into a type 2 SCD table

Problem overview

Some users have source tables that represent point-in-time snapshots. e.g., a table named users_20190701 might contain a read-only copy of the users table made at midnight on July 1st, 2019. The time of the copy was made is not recorded as an attribute within the table -- the only metadata representing the time of the copy is within the table name itself.

⚠️ If any of those users are using dbt snapshots to build a type 2 SCD they could be subject to a breaking change by #4513.

Example

This is a simple example showing changes to a table over 3 days, including insertion and deletions.

Configuring the table copies as a source

An example source configuration in dbt is:

version: 2

sources:
  - name: users
    database: raw
    schema: users
    tables:
      - name: user_{{ var("iso8601") }}

Configuring a dbt snapshot

As of this writing, specifying the updated_at parameter isn't documented here or here for the check_cols strategy. However, it can be configured like this (whether intended or not):

{% snapshot users_snapshot %}

{{
    config(
      target_database='analytics',
      target_schema='dbt_dbeatty_snapshots',
      unique_key='id',
      strategy='check',
      check_cols='all',
      updated_at="try_to_timestamp_ntz('" ~ var("iso8601") ~  "', 'YYYYMMDD')",
      invalidate_hard_deletes=True,
    )
}}

select * from {{ source("users", "user_" ~ var("iso8601")) }}

{% endsnapshot %}

Taking dbt snapshots

Processing successive days:

dbt snapshot --vars '{"iso8601": "20190701"}'
dbt snapshot --vars '{"iso8601": "20190702"}'
dbt snapshot --vars '{"iso8601": "20190703"}'

Output

Suppose consecutive dbt snapshots are taken for those 3 dates at the following timestamps:

  • 2022-04-15 12:39:05.656
  • 2022-04-15 12:39:20.298
  • 2022-04-15 12:39:37.009

Before #4513

ID COUNTER TIMESTAMP_COL DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
a 10 2020-01-01 00:00:00.000 dd497b1ca8b29e7418cc4fa6c085fa1a 2019-07-01 00:00:00.000 2019-07-01 00:00:00.000 2019-07-02 00:00:00.000
b 20 2020-01-01 00:00:00.000 730a2204c478dca213c517f9c9f5366a 2019-07-01 00:00:00.000 2019-07-01 00:00:00.000 2022-04-15 12:39:37.009
a 30 2020-01-02 00:00:00.000 6fe5b3670032e43be3066a38fc8b9012 2019-07-02 00:00:00.000 2019-07-02 00:00:00.000
c 40 2020-01-02 00:00:00.000 af264ae0ed7ec7cdc5f5b7c63ca48e56 2019-07-02 00:00:00.000 2019-07-02 00:00:00.000

After #4513

ID COUNTER TIMESTAMP_COL DBT_SCD_ID DBT_UPDATED_AT DBT_VALID_FROM DBT_VALID_TO
a 10 2020-01-01 00:00:00.000 9a616725e3cd40065fcc9f0ba08ddeed 2022-04-15 12:39:05.656 2022-04-15 12:39:05.656 2022-04-15 12:39:20.298
b 20 2020-01-01 00:00:00.000 ae9a80e646d82a83e5585056b026942e 2022-04-15 12:39:05.656 2022-04-15 12:39:05.656 2022-04-15 12:39:37.009
a 30 2020-01-02 00:00:00.000 b481297925092458d5f121c067c6eb84 2022-04-15 12:39:20.298 2022-04-15 12:39:20.298
c 40 2020-01-02 00:00:00.000 2797d4f42a6a39d4d658c986844ad1a0 2022-04-15 12:39:20.298 2022-04-15 12:39:20.298

The output of the following columns differs before and after:

  • DBT_SCD_ID
  • DBT_UPDATED_AT
  • DBT_VALID_FROM
  • DBT_VALID_TO

(I know, timestamps in tables aren't the easiest to read 😅 )

The proposed solution is intended to preserve the update and also reestablish the previous behavior.

@jtcohen6
Copy link
Contributor

@dbeatty10 Nice catch, and thanks for the clear write-up!

I agree our goal should be preserving the previous behavior for folks who have configured it as such, while also supporting the improved default for folks who haven't.

Let's open a new ticket, and accompanying PR, since it sounds like the solution to accomplish that could be a quite-simple one-liner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants