Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an archival strategy that operates by checking for column value diffs by PK #706

Closed
drewbanin opened this issue Mar 26, 2018 · 4 comments · Fixed by #1361
Closed

Add an archival strategy that operates by checking for column value diffs by PK #706

drewbanin opened this issue Mar 26, 2018 · 4 comments · Fixed by #1361
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Mar 26, 2018

Edit: See the 'check' strategy defined in #1175

dbt's implementation of archive could be scoped to specific columns. If these columns haven't changed between invocations of dbt archive, then new rows would not need to be inserted.

Most of this logic can be implemented in the materialization, though we'll also need to add a field to the archive: block in the dbt_project.yml file.

@drewbanin drewbanin added enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! labels Mar 26, 2018
@drewbanin
Copy link
Contributor Author

  • use archive block?
  • specify key columns to check?

@mplovepop
Copy link

mplovepop commented May 9, 2018

Possibly separate issues, but I was thinking along two lines: 1. A column excludes list. Our ETL brings in dozens of denormalized columns per order that would bloat the archive to no purpose. 2. Column data transform. One of our ETL brings in datetimes as varchars and it would be nice to massage into a timestamp first. I already do this with the updated_at option: coalesce(nullif(updated_at, ''), '2015-01-01T00:00:00+00:00')::timestamptz

@drewbanin
Copy link
Contributor Author

@mplovepop our current thinking involves 1) a new "archive" block and 2) archiving a query, instead of a table. This might look like:


{% archive(target_schema='dbt_archived', target_table='orders_archived') %}

select
  id,
  status,
  coalesce(nullif(updated_at, ''), '2015-01-01T00:00:00+00:00')::timestamptz as updated_at

from source_data.status

{% endarchive %}

Still some more thinking required here about the exact interface, but do you buy the general approach? I think it might work for the use cases you mentioned here

@mplovepop
Copy link

Where would that go? I like the general approach and maybe require certain archive column names, _dbt_unique_key and _dbt_updated_at that would be used when computing _dbt_sdc_id, _dbt_valid_from, _dbt_valid_to (as a suggestion to go along with another archive issue I saw here).

@drewbanin drewbanin added the snapshots Issues related to dbt's snapshot functionality label Jul 11, 2018
@drewbanin drewbanin added this to the Wilt Chamberlain milestone Nov 28, 2018
@drewbanin drewbanin changed the title Make archive column-aware Add an archival strategy that operates by checking for column value diffs by PK Mar 23, 2019
beckjake added a commit that referenced this issue Apr 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! snapshots Issues related to dbt's snapshot functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants