Record failing rows for tests into an auditable table #903

drewbanin · 2018-08-06T13:49:07Z

Feature

Feature description

dbt should record rows that fail schema or data tests into tables, to be used for auditing. There are two general ways this could work:

insert into an errors table with a schema like:

run started at
invocation id
test name
failing row primary key / json

Create a table for each test failure in a scratch schema, matching the schema of the test query that failed

The baked-in schema tests will need to be rewritten to provide debuggable information in these tables.

This should probably be an opt-in feature for tests. It could be configured either in schema.yml, though it would also be good to configure whole swaths of tests at once.

Who will this benefit?

dbt users could use this table to quickly determine why their tests failed. At present, it's surprisingly difficult to debug a failed test.

The text was updated successfully, but these errors were encountered:

alanmcruickshank · 2018-11-14T11:22:52Z

@drewbanin - I totally get that we wouldn't want to bring back all the rows by default, what would be awesome, would be some kind of executable SQL on fail. So the workflow would look like:

Run test
Evaluate Pass/Fail
[optionally] On fail run SQL and spit out the result to logs.

The reason I think this would work nicely is that our analysts are used to looking at the logs of a build to see if something has failed, and having the info right there in the logs would be a massive productivity gain - albeit if there's a config overhead to write in explicitly what should happen on failing for each test.

drewbanin · 2018-11-14T14:26:51Z

Totally with you @alanmcruickshank. I think also related is: #517

I'm into the idea, but need to give some more thought into how dbt would support this. We don't have a good way of configuring groups of tests at the moment, and i imagine setting report_on_failure: true for every schema test would not be ideal.

Very open to ideas if you have them!

ianterrell · 2019-04-01T17:32:55Z

I support and would use the use case of reporting on failed tests. I am more interested in using them for auditing than for debugging — I would like a nice easy table to query via Looker for a conditional "Go Fix ETL" alert.

When I went looking to do this feature I naturally reached for hooks, as I'm using them already for audit purposes much as described in the documentation.

I'm logging the individual model run:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_runs (model text, state text, time timestamp)"

models:
  pre-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'start', getdate())"
  post-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'end', getdate())"

As well as the overall results:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

(I have a simple macro results_values to turn the results into insertable values.)

In this case I would generally like the exact same thing for tests. Something exactly analogous to the on-run-start/on-run-end with the end run context results would suit my needs:

on-test-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_test_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-test-end:
    - "insert into {{ target.schema }}.audit_dbt_test_results (node, status, execution_time, time) values {{ results_values(results) }}"

ironmonkey88 · 2019-04-01T18:37:44Z

that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

Aylr · 2019-05-15T14:02:22Z

@ianterrell This is a great idea. Are you able & willing to share your results_values macro?

ironmonkey88 · 2019-05-15T17:52:58Z

i'd be interested in getting a look at that macro was well.

Aylr · 2019-05-15T20:38:04Z

Here's an easy macro for this.

{% macro unpack_results(results) %}

  {% for res in results -%}
-- This is useful for debugging
--     {{ log(res, info=True) }}
        ('{{ invocation_id }}', '{{run_started_at}}', NOW(), '{{ res.node.unique_id }}', '{{ res.status }}', '{{ res.error if res.error != None else ""}}', '{{ res.skip if res.skip != None else "" }}', '{{ res.fail if res.fail != None else "" }}', {{ res.execution_time }} ){{ "," if not loop.last }}
  {% endfor %}

{% endmacro %}

ianterrell · 2019-05-17T15:43:02Z

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

bashyroger · 2020-06-12T12:13:10Z

that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

This is exactly what I am looking for!
IMO a run / test / error mart is a prerequisite for any mature data pipeline. Especially to track data run duration and data quality over time. Such a model can then be used as a source for firing alerts when data quality starts to deviate from an average , a run deviates from an average etc....

Of course one could do this by bringing in other tools that help with this (like https://docs.greatexpectations.io, a tool I have seen being suggested a few times).
But imo this should be part of the CORE of dbt; for me it falls under the altering and logging features displayed here: https://www.getdbt.com/product/

jtcohen6 · 2020-06-25T17:48:41Z

Hey @bashyroger, thanks for bringing this topic back to the fore. I just opened a new issue to sketch out our latest thinking about how dbt should store tests in the database: #2593

rajib76 · 2021-03-14T21:22:15Z

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

I noticed that i need to add a commit statement also in the 'on-run-end' to persist the record. Am I doing something wrong?

ponsafg · 2021-08-05T00:49:44Z

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

Hi @ianterrell ,
Thanks for sharing the code. I need to enhance this macro. Is there a way to execute this macro by itself? I can use this macro using on-run-end but am unable to execute this micro using dbt run-operation.

Thanks,
Pons

ponsafg · 2021-08-05T01:21:13Z

Hi @Aylr and @gordonhwong!
In case it's still useful for anyone, the following macro is what I use:
{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}
This pairs with:
on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"
Hi @ianterrell ,
Thanks for sharing the code. I need to enhance this macro. Is there a way to execute this macro by itself? I can use this macro using on-run-end but am unable to execute this micro using dbt run-operation.

Thanks,
Pons

Looks like, I can only execute using an on-run-end. Below is what I got from DBT documentation
"The results variable contains a list of Result objects with one element per resource that executed in the dbt job. The Result object provides access within the Jinja on-run-end context to the information that will populate the run results JSON artifact."

trevor-petach · 2021-08-18T21:05:12Z

@jtcohen6 How would you feel about re-opening this issue? I greatly appreciate the work you did in #3316, but I don't think it satisfies the request in this ticket - namely, an audit log of test results.

In particular, the limitations of #3316 are:

Not immediately apparent whether a test was recently run (have to look at table creation timestamps), and tables for tests that have been removed from the project continue to persist.
Not immediately apparent whether a test passed on its most recent run (have to count rows in the table).
Not immediately apparent which tests correspond to which tables since tests with long names are truncated with a hash that is not easy to map back to the original test name.
Tables are overwritten every time a test is run, so there is no historical record of test results.
The compiled test queries themselves are not saved.
No record of which user ran the test.

Options that I see to generate an audit log:

Enable on-run-start and on-run-end hooks for dbt test.
Create on-test-start and on-test-end hooks analogous to on-run-start and on-run-end, and have results object available in on-test-end context.
Enable pre-hooks and post-hooks for tests.
Enable inclusion of local files in macros (via {% include %}) so that run_results.json could be parsed and written to the database in a run-operation command.
Snapshot all tables in the dbt_test__audit schema immediately after dbt test. Does not solve limitations 1 and 3 above.

Option 3 is @ianterrell's idea of using pre and post hooks to generate the test audit log. We do something similar to create a run audit log for some of our clients who use dbt, and it has worked very well.

Aside from the work to do the implementation, are there blockers/reasons not to add pre and post hooks to tests? I would be happy to attempt to do the implementation with some (likely substantial) guidance.

jtcohen6 · 2021-08-26T09:45:00Z

@trevor-petach Thanks for the thoughtful comment!

In my current view of the world, I see the solution to problems 1-6 being metadata. Just about all of the information you've enumerated is available in artifacts (run_results.json + manifest.json) produced by the dbt test command:

The timestamp when the test was run, and how long it took to run (run results)
The test status and number of failures (run results)
The relation_name of the table where the test stored its failures (manifest)
History of test statuses and number of failures over time, by saving artifacts from past runs. (If you want the exact failing records themselves, snapshotting is not a bad idea! That feels more like data than metadata, so storing it in the database feels right.)
The compiled_sql of the executed test (manifest)
The artifacts won't record which user ran the test (dbt doesn't store any credential information in artifacts), but it's quite possible to store custom environment metadata, e.g. by defining environment variables with the DBT_ENV_CUSTOM_ENV_ prefix

Given that this information is available in JSON artifacts, and in such a rich format, I'd rather encourage you to:

Access this information via the dbt Cloud Metadata API (cc @barryaron)
Upload those artifacts directly to your data warehouse, just as you suggest! This is possible to do in a dbt macro, at least on Snowflake, via put statement. Alternatively, your orchestration tool may be able to upload those artifacts to external storage (e.g. S3), where they're accessible to querying later on.

That said, I think you're onto something here:

Enable on-run-start and on-run-end hooks for dbt test.

I agree! I said something to this effect in response to a more recent issue: #3463. I think the basic change, turning on these hooks for the test task, would be quite straightforward, and I'd welcome a community contribution for it.

Since the Results object is available in the on-run-end context, you could push information from that object into an audit table in your database, if you so chose.

joellabes · 2021-08-26T10:40:15Z

Wanted to share the approach we're taking for visibility of failing tests. I don't think it'd scale to all 1200+ tests, but right now we don't need it to - we're focused on exposing data quality issues to other teams, in the hope that showing them a chart with number of errors going down will encourage them to keep the CRM tidy 🤞

We're using dbt-utils' get_relations_by_pattern and union_relations macros, and then will snapshot the results of that.

{% set relations = dbt_utils.get_relations_by_pattern('%', 'hubspot_revenue_test__%') %}

with tests as (
    {{ dbt_utils.union_relations(relations, exclude=['_fivetran_synced']) }}
)

select 
    initcap(translate(regexp_replace(_dbt_source_relation, '["A-z\.]+hubspot_revenue_test__', ''), '_"', ' '))  as failing_test, --icky, but works to make the error friendlier
    pipeline_label as team, 
    count(*) as num_records
from tests
group by 1, 2

As much as we'd love to get amongst the artifact goodness, I don't think Redshift doesn't have a PUT equivalent so we're probably dependent on a dbt Cloud native solution for the moment.

trevor-petach · 2021-08-27T04:14:11Z

@jtcohen6 Agreed that artifacts have the needed information - the issue is getting that information somewhere useful (i.e. the database). For advanced users, this is not a problem (a quick script that kicks off after dbt test completes to write run_results.json to the database), but for more basic users, it is a substantial stumbling block. Most people we work with use BigQuery, so face the same problem as @joellabes in Redshift - there is no equivalent of PUT. As far as I can tell, the dbt cloud API suffers from the same problem - I have to run the query and parse a relatively complex result on my infrastructure.

My goal is be able to build a dashboard to show recent test failures or set an alert/notification for test failures (that includes some information about what failed) for a user who just uses dbt cloud to execute dbt run and dbt test.

One minor issue with the metadata objects is that they are not available until all of the tests execute. If we had pre and post-test hooks, then we could record the results immediately into an audit table.

drewbanin mentioned this issue Aug 6, 2018

hooks for tests #549

Closed

drewbanin mentioned this issue Dec 5, 2018

Schema test blocks #1173

Closed

joshtemple mentioned this issue Jul 25, 2019

compiled dbt test output could be more helpful #517

Closed

drewbanin mentioned this issue Sep 26, 2019

partial parsing: support overlapping sets of files by parser #1778

Closed

drewbanin mentioned this issue Nov 25, 2019

--verbose flag #619

Closed

drewbanin added the enhancement New feature or request label Jun 12, 2020

jtcohen6 self-assigned this Jun 12, 2020

jtcohen6 mentioned this issue Jun 25, 2020

Storing test results in the database #2593

Closed

jtcohen6 mentioned this issue May 4, 2021

dbt test --store-failures #3316

Merged

4 tasks

jtcohen6 closed this as completed in #3316 May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record failing rows for tests into an auditable table #903

Record failing rows for tests into an auditable table #903

drewbanin commented Aug 6, 2018

alanmcruickshank commented Nov 14, 2018

drewbanin commented Nov 14, 2018

ianterrell commented Apr 1, 2019 •

edited

Loading

ironmonkey88 commented Apr 1, 2019

Aylr commented May 15, 2019

ironmonkey88 commented May 15, 2019

Aylr commented May 15, 2019

ianterrell commented May 17, 2019

bashyroger commented Jun 12, 2020

jtcohen6 commented Jun 25, 2020

rajib76 commented Mar 14, 2021

ponsafg commented Aug 5, 2021

ponsafg commented Aug 5, 2021

trevor-petach commented Aug 18, 2021

jtcohen6 commented Aug 26, 2021

joellabes commented Aug 26, 2021

trevor-petach commented Aug 27, 2021

Record failing rows for tests into an auditable table #903

Record failing rows for tests into an auditable table #903

Comments

drewbanin commented Aug 6, 2018

Feature

Feature description

Who will this benefit?

alanmcruickshank commented Nov 14, 2018

drewbanin commented Nov 14, 2018

ianterrell commented Apr 1, 2019 • edited Loading

ironmonkey88 commented Apr 1, 2019

Aylr commented May 15, 2019

ironmonkey88 commented May 15, 2019

Aylr commented May 15, 2019

ianterrell commented May 17, 2019

bashyroger commented Jun 12, 2020

jtcohen6 commented Jun 25, 2020

rajib76 commented Mar 14, 2021

ponsafg commented Aug 5, 2021

ponsafg commented Aug 5, 2021

trevor-petach commented Aug 18, 2021

jtcohen6 commented Aug 26, 2021

joellabes commented Aug 26, 2021

trevor-petach commented Aug 27, 2021

ianterrell commented Apr 1, 2019 •

edited

Loading