Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT Snapshot filling in dbt_valid_to for most recent record #52

Closed
1 task done
dhodge250 opened this issue Nov 15, 2022 · 26 comments
Closed
1 task done

DBT Snapshot filling in dbt_valid_to for most recent record #52

dhodge250 opened this issue Nov 15, 2022 · 26 comments
Assignees
Labels
bug Something isn't working

Comments

@dhodge250
Copy link

dhodge250 commented Nov 15, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

After running the dbt snapshot command and building a snapshot table with the invalidate_hard_deletes parameter set to True, various records are having their dbt_valid_to dates populated with the latest dbt_updated_at value, even though these records are still considered valid and have not been removed from their source (staging) tables. We are using the following code to build each snapshot:

{% snapshot tmp_device_ad %}

{{
    config(
           target_schema = 'dbt',
           strategy = 'timestamp',
           unique_key = 'snap_id',
           updated_at = 'update_dt',
           invalidate_hard_deletes = True
          )
}}

select ad.*

  from {{ ref('stg_device_ad') }} ad

{% endsnapshot %}

The number of records this is affecting is inconsistent between snapshots (sometimes it's a few dozen, or a few hundred from tables up to 30k), but it appears that it is affecting the same records on each run.

Expected Behavior

When building a dbt snapshot table and setting the invalidate_hard_deletes parameter to True, I expect dbt to only fill in the dbt_valid_to value ONLY if a record no longer exists in the source table OR if a record has changed in the source table and a new record is created in the snapshot table and the previous record in the snapshot table is marked as invalid. An active record in the snapshot table SHOULD NOT be marked as having a dbt_valid_to date, instead it should have a value of null.

Steps To Reproduce

  1. Environment: Windows 10 Enterprise 21H2 19044.2251
  2. Config:
{% snapshot tmp_device_ad %}

{{
    config(
           target_schema = 'dbt',
           strategy = 'timestamp',
           unique_key = 'snap_id',
           updated_at = 'update_dt',
           invalidate_hard_deletes = True
          )
}}

select ad.*

  from {{ ref('stg_device_ad') }} ad

{% endsnapshot %}
  1. Run: dbt snapshot
  2. No errors are generated

Relevant log output using --debug flag enabled

No response

Environment

- OS: Windows 10 Enterprise 21H2 19044.2251
- Python: 3.9.11
- dbt: 1.3.0

What Oracle database version are you using dbt with?

19c

Additional Context

This issue looks to be similar, if not identical, to #2390 from dbt-core a few years ago that was resolved in v0.17.0. I've created a fork to play around with this and see if the two issues are related.

@dhodge250 dhodge250 added the bug Something isn't working label Nov 15, 2022
@aosingh aosingh self-assigned this Nov 15, 2022
@aosingh
Copy link
Member

aosingh commented Nov 15, 2022

Thanks @dhodge250 for the detailed report. I am trying to reproduce this.

@aosingh
Copy link
Member

aosingh commented Nov 15, 2022

I think the issue could be related to duplicate value of dbt_scd_id. dbt snapshot uses MERGE INTO statement to to update the column dbt_valid_to

MERGE INTO target t 
USING staging s 
ON (s.dbt_scd_id = d.dbt_scd_id)

dbt-oracle uses ORA_HASH to compute dbt_scd_id. It could be that the column dbt_scd_id has the same value for 2 different change types (updates and inserts)

I checked dbt Labs's implementation for PostgresSQL and it seems that they use MD5 hash to compute dbt_scd_id. Although, MD5 still has some chances of a collision.

Let me test with the STANDARD_HASH function Oracle DB provides. it supports MD5 and also multiple stronger hashing alternatives like SHA128, SHA256, etc.

I will share a macro here which you can include in your dbt project and that should override the default implementation of hashing method included in dbt-oracle.

@aosingh
Copy link
Member

aosingh commented Nov 15, 2022

@dhodge250

Could you include the following macro in your dbt project and then test dbt snapshot ? You can create a file macros/hash.sql in your dbt project and include the macro definition in there.

{% macro oracle__snapshot_hash_arguments(args) -%}
    STANDARD_HASH({%- for arg in args -%}
        coalesce(cast({{ arg }} as varchar(4000) ), '')
        {% if not loop.last %} || '|' || {% endif %}
    {%- endfor -%}, 'SHA256')
{%- endmacro %}

As we are changing the hashing method for dbt_scd_id, it will be better to start fresh else there will be a datatype mismatch error when building the staging table.

Let me know if you have any questions

@dhodge250
Copy link
Author

@aosingh thank you for the update and quick response!

I will add the macro to our project and start testing and let you know what I see for results.

@dhodge250
Copy link
Author

@aosingh

After adding the hash.sql macro you've provided me with to the macros folder in our dbt project, I'm getting the ORA-01790: expression must have same datatype as corresponding expression error during the Insertions selection. I've added a selection of the dbt log file below.

For additional context, we are utilizing a separate MD5 hashing macro to generate the unique_id's in the source tables used for each snapshot.

{%- macro surrogate_key(field_list) -%}

{#-- Combine strings from list into single string, create hash column--#}
{%- set key_as_string = field_list | join("|| '-' ||") -%}
{{ "rawtohex(standard_hash(" ~ key_as_string ~ ", 'MD5'))" }}

{%- endmacro -%}

To my knowledge, these ID's should not be responsible for the ORA-01790 error, since these really are only used in the generation of the dbt_scd_id. We use this macro to create a standard-formatted unique_id prior to creating each snapshot table.

dbt.log

@aosingh
Copy link
Member

aosingh commented Nov 16, 2022

@dhodge250

ORA-01790 is because of mismatch in datatype of dbt_scd_id

By default (old method) dbt_scd_id's data type is NUMBER and in the new macro dbt_scd_id's data type is RAW(32 BYTE)

This is due to the difference in between ORA_HASH and STANDARD_HASH functions. However, I highly recommend using STANDARD_HASH because of the strong hash algorithms it supports and avoiding hash collision.

There are 2 options going forward

Option 1:

Drop the table TMP_DEVICE_AD and start fresh. dbt snapshot will create it again.

Option 2:

Second option would be to run an UPDATE on the column dbt_scd_id in TMP_DEVICE_AD using the STANDARD_HASH function as shown below

STANDARD_HASH(coalesce(cast(snap_id as varchar(4000) ), '') || 
       '|' || coalesce(cast(update_dt as varchar(4000) ), ''), 'SHA256')

@dhodge250
Copy link
Author

@aosingh

I've opted for your suggestion with Option 2 to save the data from TMP_DEVICE_AD and have updated the data type for dbt_scd_id to RAW(32 BYTE). I was able to successfully run the TMP_DEVICE_AD snapshot using the new hash.sql macro, and have confirmed that the issue we are seeing with the dbt_valid_to values is still present (see image below for example). I've done a compare between the source STG_DEVICE_AD and snapshot tables and found 188 records from the source table (out of 9,831) that are still getting a dbt_valid_to value.

snapshot

@aosingh
Copy link
Member

aosingh commented Nov 16, 2022

@dhodge250

Did these records exist prior to running the dbt snapshot command for which you originally reported the issue ? If yes, the dbt_valid_to will not be automatically unset by the snapshot command.

@dhodge250
Copy link
Author

dhodge250 commented Nov 17, 2022

@aosingh

No, these incorrect records did not exist prior to running the dbt snapshot command. After each snapshot table is built we run through a script to clean up the incorrect dbt_valid_to values and reset the valid records back to null. I can confirm that after this script is run, if I were to run the dbt snapshot command again the same records that were just cleaned up would then have their dbt_valid_to values populated again incorrectly.

@aosingh
Copy link
Member

aosingh commented Nov 17, 2022

@dhodge250

Thank you for confirming. I think at this point, it would be good to check what is in the staging table. We update the snapshot table using the staging table. I have added the SQL to create the staging table below. It would be good to check how many entries do we have for the above record in the staging table.

create_staging_table.txt

@aosingh
Copy link
Member

aosingh commented Nov 18, 2022

@dhodge250
Just wanted to confirm that by staging table, I meant global temporary table which dbt snapshot creates first.

The attached SQL in the above comment looks like this:

CREATE GLOBAL TEMPORARY TABLE o$pt_tmp_device_ad152327
 ON COMMIT PRESERVE ROWS
  AS
    WITH snapshot_query AS (

SELECT ad.*

  FROM dbt.stg_device_ad ad
.....

Let me know if you have any questions

@dhodge250
Copy link
Author

dhodge250 commented Nov 18, 2022

@aosingh

I'll try creating the staging table for you in a bit using the SQL you've provided me with. I'll need to remove some of the PII from the table before I submit, but it will still include the required snap_id and update_dt columns to generate the dbt_scd_id value.

EDIT:
As requested, I've attached the global temporary table created by the dbt snapshot:
O_PT_TMP_DEVICE_AD074341_DATA_TABLE.csv

@aosingh
Copy link
Member

aosingh commented Nov 18, 2022

Thank you @dhodge250 for sharing the staging table records. This was really helpful. As I had suspected, same record is getting picked up for insert and update.

DBT_CHANGE_TYPE UPDATE_DT SNAP_ID DBT_VALID_FROM DBT_VALID_TO
insert 14-Nov-2022 0124CA8A277B605072F81615D49B47AC 14-NOV-22 11.26.19.000000000 PM
update 14-Nov-2022 0124CA8A277B605072F81615D49B47AC 14-NOV-22 11.26.19.000000000 PM 14-NOV-22 11.26.19.000000000 PM

These 2 rows should not show up in staging table because there were no updates and the record is still valid i.e. DBT_VALID_FROM >= UPDATE_DT.if you look at the above data this condition is True. These should not get picked up.

Rows are picked up for staging if DBT_VALID_FROM < UPDATE_DT.

I am trying to understand how did these get picked up. Also, UPDATE_DT stores truncated date (without time) whereas DBT_VALID_FROM stores the time part as well. After dbt snapshot, they should be of the same format.

  • What is the datatype for UPDATE_DT
  • Does UPDATE_DT not store the time part ?
  • You mentioned there is a script which updates DBT_VALID_TO. Does the script update any other columns ?

@dhodge250
Copy link
Author

dhodge250 commented Nov 21, 2022

@aosingh

Your recent comment has me thinking more about the datetime data types that are being used in UPDATE_DT, and if that could be the source of our issues. I'll provide you with the answers to your questions above in the next few days.

EDIT 11/22/2022
Here are the answers to your questions above:

  • What is the datatype for UPDATE_DT ?

UPDATE_DT is just using the standard DATE data type. It is being built through a case statement from data coming from both VARCHAR2 and DATE data types

  • Does UPDATE_DT not store the time part ?

No, UPDATE_DT in this use does not store the time part, or at least we are not including the time part. During one of the date castings we are using to_date('19000101','yyyymmdd'). We have the option of including the time, but currently we are not.

  • You mentioned there is a script which updates DBT_VALID_TO. Does the script update any other columns ?

No, the script only sets active records in the snapshot tables that have DBT_VALID_TO values other than null to null. It does not do anything else other than that.

@aosingh
Copy link
Member

aosingh commented Nov 28, 2022

Thanks for the answers @dhodge250

Having UPDATE_DT as date data type should be fine. I don't think that is the issue.

If you don't save the time part in UPDATE_DT, then DBT_VALID_FROM and DBT_VALID_TO should also not have the time part. How do DBT_VALID_FROM and DBT_VALID_TO store the time part is something I still did not understand?
Where does the time 11.26.19.000000000 PM come from ?

Going back to the 2 rows the column UPDATE_DT has value 14-Nov-2022 which translates to 14-Nov-2022:00:00:00 Could you also confirm this ?

Rows are picked up for staging if DBT_VALID_FROM < UPDATE_DT and 14-NOV-22 11.26.19.000000000 PM is clearly greater than 14-Nov-2022:00:00:00. These 2 rows should not have been picked up. I am still confused.

@dhodge250
Copy link
Author

@aosingh

Apologies for the delay. I've had a chance to review the UPDATE_DT column again, and it looks as if it does include the time part. I was mistaken before, thinking that casting the column using to_date('19000101','yyyymmdd') would eliminate the time part from the originating column. Knowing that now, the 2 rows from the UPDATE_DT column both have values of 14-NOV-2022:23:26:19. Converting to the 12-hour format, that would be 14-NOV-2022:11:26:19. It looks like the DBT_VALID_FROM and DBT_VALID_TO values are correct.

Sorry for the confusion that lead to.

@ramapemmaraju
Copy link

Closing this issue as per previous comment.

@dhodge250
Copy link
Author

@ramapemmaraju

What is the resolution to this issue? In my previous comment I was just stating that the DBT_VALID_FROM andDBT_VALID_TO values that @aosingh had referenced in their comment

Where does the time 11.26.19.000000000 PM come from ?

were correct in their formatting based on the time being included in the date value. The issue I opened with snapshots is still present for us, unless you're saying that the time formatting is the issue here.

@ramapemmaraju
Copy link

@dhodge250 sorry for wrongly closing this. Reopening.

@ramapemmaraju ramapemmaraju reopened this Dec 8, 2022
@aosingh
Copy link
Member

aosingh commented Dec 12, 2022

@dhodge250

Let's compare UPDATE_DT and DBT_VALID_FROM values of only one record.

I have picked SNAP_ID=0124CA8A277B605072F81615D49B47AC

SELECT update_dt 
FROM dbt.stg_device_ad ad
WHERE ad.snap_id = '0124CA8A277B605072F81615D49B47AC'

UNION

SELECT dbt_valid_from AS update_dt
FROM dbt.tmp_device_ad
WHERE dbt_unique_key = '0124CA8A277B605072F81615D49B47AC'
AND dbt_valid_to = NULL

This is to check if DBT_VALID_FROM < UPDATE_DT for records which are wrongly invalidated

@dhodge250
Copy link
Author

@aosingh

Here are the DBT_VALID_FROM and UPDATE_DT values for the record you selected:
image

Because the values are the same, I had to modify your code to select them individually:

SELECT TO_CHAR(update_dt, 'MM/DD/YY HH:MI:SS A.M.') AS update_dt
  FROM dbt.stg_device_ad ad
 WHERE ad.snap_id = '0124CA8A277B605072F81615D49B47AC'

 UNION ALL

SELECT TO_CHAR(dbt_valid_from, 'MM/DD/YY HH:MI:SS A.M.') AS update_dt
  FROM dbt.tmp_device_ad
 WHERE snap_id = '0124CA8A277B605072F81615D49B47AC'
 AND dbt_valid_to is NULL

Is this what you were looking to see?

@aosingh
Copy link
Member

aosingh commented Dec 13, 2022

@dhodge250

This looks correct.

Would you be open to discuss this issue on a zoom call ?
I recommend joining dbt Lab's slack channel for Oracle DB and we can take it from there. I am a part of that group.

@aosingh
Copy link
Member

aosingh commented Dec 14, 2022

@dhodge250

I think the issue is identified along with it's fix.

It is related to the computation of dbt_scd_id. The problem is that while casting a date as varchar, time part gets truncated. So the hash value dbt_scd_id is duplicate.

There are 2 steps to follow to test the fix

  1. Run an UPDATE on the column dbt_scd_id in TMP_DEVICE_AD as shown below. Notice the TO_CHAR
STANDARD_HASH(coalesce(cast(snap_id as varchar(4000) ), '') || 
       '|' || coalesce(cast(TO_CHAR(update_dt, 'MM/DD/YY HH:MI:SS A.M.') as varchar(4000) ), ''), 'SHA256')
  1. Include the following macro in your project and then run dbt snapshot
{% macro snapshot_timestamp_strategy(node, snapshotted_rel, current_rel, config, target_exists) %}
    {% set primary_key = config['unique_key'] %}
    {% set updated_at = config['updated_at'] %}
    {% set invalidate_hard_deletes = config.get('invalidate_hard_deletes', false) %}

    {% set row_changed_expr -%}
        ({{ snapshotted_rel }}.dbt_valid_from < {{ current_rel }}.{{ updated_at }})
    {%- endset %}
    {% set scd_id_expr = snapshot_hash_arguments([primary_key, 'TO_CHAR(' ~ updated_at ~ ', \'MM/DD/YY HH:MI:SS A.M.\')']) %}

    {% do return({
        "unique_key": primary_key,
        "updated_at": updated_at,
        "row_changed": row_changed_expr,
        "scd_id": scd_id_expr,
        "invalidate_hard_deletes": invalidate_hard_deletes
    }) %}
{% endmacro %}

@dhodge250
Copy link
Author

@aosingh

My schedule is a little sporadic right now, so I'm just seeing your message from yesterday. I would be available for a zoom call on 12/15 if you would still like to discuss this issue.

I was able to follow your recommendations and run an UPDATE on the dbt_scd_id column in TMP_DEVICE_AD, and include the new snapshot_timestamp_strategy macro in my project. After running dbt snapshot --select tmp_device_ad I am no longer seeing active records in TMP_DEVICE_AD with a value in the dbt_scd_id column. I then followed the same steps on my other snapshot tables, and am not seeing active records with a value in dbt_scd_id in those tables either.

It appears that this issue was caused by the time part being truncated while casting the date as varchar, and your snapshot_timestamp_strategy has corrected the issue.

aosingh added a commit that referenced this issue Dec 14, 2022
- Added module and client_identifier as dbt connection string
- Bugfix for dbt snapshots #52
- oracledb dependency upgraded to 1.2.1
- Global temp relation name (time format) includes milliseconds
@aosingh
Copy link
Member

aosingh commented Dec 22, 2022

This is merged in dbt-oracle==1.3.1

@dhodge250
Copy link
Author

@aosingh Thank you for the support on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants