Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] hash_collissions in dbt snapshot #154

Open
1 task done
espenhoh opened this issue Sep 3, 2024 · 1 comment
Open
1 task done

[Bug] hash_collissions in dbt snapshot #154

espenhoh opened this issue Sep 3, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@espenhoh
Copy link

espenhoh commented Sep 3, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

snapshot is implemented with the ora_hash() function.
This gives a lot of collissions in a large data set and duplicate ids, causing errors in the merge statement for data volumes of millions of rows.
dbt snapshot will fail once a hash collission appears, and the snapshot tables becomes impossible to update.

Expected Behavior

No collissions,

Steps To Reproduce

run dbt snapshot command on data with a few million updates.
First dbt snapshot command succeeds as there are only inserts, but duplicates in dbt_scd_id exists causing trouble on subsequest runs.

Relevant log output using --debug flag enabled

06:08:52    Database Error in snapshot table_name (snapshots\table_name.sql)
  ORA-30926: unable to get a stable set of rows in the source tables
  Help: https://docs.oracle.com/error-help/db/ora-30926/
  compiled Code at target\run\dbt_project\snapshots\table_name.sql

Environment

- OS: Windows
- Python: 3.11
- dbt: 1.8.1

What Oracle database version are you using dbt with?

19c

Additional Context

No response

@espenhoh espenhoh added the bug Something isn't working label Sep 3, 2024
@aosingh aosingh self-assigned this Sep 3, 2024
@aosingh
Copy link
Member

aosingh commented Sep 4, 2024

Hi @espenhoh

I understand the issue because we have seen these in the past.

#52
#102

And yes, we decided not to update ORA_HASH function because it would break existing snapshots, Users who are starting fresh snapshots should include the following macro in their dbt project. This uses a much stronger hash function - SHA256

{% macro oracle__snapshot_hash_arguments(args) -%}
    STANDARD_HASH({%- for arg in args -%}
        coalesce(cast({{ arg }} as varchar(4000) ), '')
        {% if not loop.last %} || '|' || {% endif %}
    {%- endfor -%}, 'SHA256')
{%- endmacro %}

For existing snapshots, to migrate, there is an option.

dbt snapshot uses MERGE INTO statement.

MERGE INTO target t
USING staging s
ON (s.dbt_scd_id = d.dbt_scd_id)

So, before using the new hash function you need to UPDATE the column dbt_scd_id in the snapshot table using the STANDARD_HASH function.

UPDATE <SNAPSHOT_TABLE> 
SET dbt_scd_id = STANDARD_HASH(<args>, 'SHA256')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants