Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record failing rows for tests into an auditable table #903

Closed
drewbanin opened this issue Aug 6, 2018 · 17 comments · Fixed by #3316
Closed

Record failing rows for tests into an auditable table #903

drewbanin opened this issue Aug 6, 2018 · 17 comments · Fixed by #3316
Assignees
Labels
enhancement New feature or request

Comments

@drewbanin
Copy link
Contributor

Feature

Feature description

dbt should record rows that fail schema or data tests into tables, to be used for auditing. There are two general ways this could work:

  1. insert into an errors table with a schema like:
  • run started at
  • invocation id
  • test name
  • failing row primary key / json
  1. Create a table for each test failure in a scratch schema, matching the schema of the test query that failed

The baked-in schema tests will need to be rewritten to provide debuggable information in these tables.

This should probably be an opt-in feature for tests. It could be configured either in schema.yml, though it would also be good to configure whole swaths of tests at once.

Who will this benefit?

dbt users could use this table to quickly determine why their tests failed. At present, it's surprisingly difficult to debug a failed test.

@alanmcruickshank
Copy link
Contributor

@drewbanin - I totally get that we wouldn't want to bring back all the rows by default, what would be awesome, would be some kind of executable SQL on fail. So the workflow would look like:

  1. Run test
  2. Evaluate Pass/Fail
  3. [optionally] On fail run SQL and spit out the result to logs.

The reason I think this would work nicely is that our analysts are used to looking at the logs of a build to see if something has failed, and having the info right there in the logs would be a massive productivity gain - albeit if there's a config overhead to write in explicitly what should happen on failing for each test.

@drewbanin
Copy link
Contributor Author

Totally with you @alanmcruickshank. I think also related is: #517

I'm into the idea, but need to give some more thought into how dbt would support this. We don't have a good way of configuring groups of tests at the moment, and i imagine setting report_on_failure: true for every schema test would not be ideal.

Very open to ideas if you have them!

@ianterrell
Copy link

ianterrell commented Apr 1, 2019

I support and would use the use case of reporting on failed tests. I am more interested in using them for auditing than for debugging — I would like a nice easy table to query via Looker for a conditional "Go Fix ETL" alert.

When I went looking to do this feature I naturally reached for hooks, as I'm using them already for audit purposes much as described in the documentation.

I'm logging the individual model run:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_runs (model text, state text, time timestamp)"

models:
  pre-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'start', getdate())"
  post-hook:
    - "insert into {{ target.schema }}.audit_dbt_runs (model, state, time) values ('{{this.name}}', 'end', getdate())"

As well as the overall results:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

(I have a simple macro results_values to turn the results into insertable values.)

In this case I would generally like the exact same thing for tests. Something exactly analogous to the on-run-start/on-run-end with the end run context results would suit my needs:

on-test-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_test_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-test-end:
    - "insert into {{ target.schema }}.audit_dbt_test_results (node, status, execution_time, time) values {{ results_values(results) }}"

@ironmonkey88
Copy link

that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

@Aylr
Copy link

Aylr commented May 15, 2019

@ianterrell This is a great idea. Are you able & willing to share your results_values macro?

@ironmonkey88
Copy link

i'd be interested in getting a look at that macro was well.

@Aylr
Copy link

Aylr commented May 15, 2019

Here's an easy macro for this.

{% macro unpack_results(results) %}

  {% for res in results -%}
-- This is useful for debugging
--     {{ log(res, info=True) }}
        ('{{ invocation_id }}', '{{run_started_at}}', NOW(), '{{ res.node.unique_id }}', '{{ res.status }}', '{{ res.error if res.error != None else ""}}', '{{ res.skip if res.skip != None else "" }}', '{{ res.fail if res.fail != None else "" }}', {{ res.execution_time }} ){{ "," if not loop.last }}
  {% endfor %}

{% endmacro %}

@ianterrell
Copy link

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

@bashyroger
Copy link

that looks like a great start to a 'dbt_results_mart', where you could have models as a dim, dbt_runs as a dim and fact and errors as an optional fact

This is exactly what I am looking for!
IMO a run / test / error mart is a prerequisite for any mature data pipeline. Especially to track data run duration and data quality over time. Such a model can then be used as a source for firing alerts when data quality starts to deviate from an average , a run deviates from an average etc....

Of course one could do this by bringing in other tools that help with this (like https://docs.greatexpectations.io, a tool I have seen being suggested a few times).
But imo this should be part of the CORE of dbt; for me it falls under the altering and logging features displayed here: https://www.getdbt.com/product/

@drewbanin drewbanin added the enhancement New feature or request label Jun 12, 2020
@jtcohen6 jtcohen6 self-assigned this Jun 12, 2020
@jtcohen6
Copy link
Contributor

Hey @bashyroger, thanks for bringing this topic back to the fore. I just opened a new issue to sketch out our latest thinking about how dbt should store tests in the database: #2593

@rajib76
Copy link

rajib76 commented Mar 14, 2021

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

I noticed that i need to add a commit statement also in the 'on-run-end' to persist the record. Am I doing something wrong?

@ponsafg
Copy link

ponsafg commented Aug 5, 2021

Hi @Aylr and @gordonhwong!

In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

Hi @ianterrell ,
Thanks for sharing the code. I need to enhance this macro. Is there a way to execute this macro by itself? I can use this macro using on-run-end but am unable to execute this micro using dbt run-operation.

Thanks,
Pons

@ponsafg
Copy link

ponsafg commented Aug 5, 2021

Hi @Aylr and @gordonhwong!
In case it's still useful for anyone, the following macro is what I use:

{% macro results_values(results) %}
  {% for res in results -%}
    {% if loop.index > 1 %},{% endif %}
    ('{{ res.node.alias }}', '{{ res.status }}', {{ res.execution_time }}, getdate())
  {% endfor %}
{% endmacro %}

This pairs with:

on-run-start:
    - "create table if not exists {{ target.schema }}.audit_dbt_results (node text, status text, execution_time decimal(10,2), time timestamp)"
on-run-end:
    - "insert into {{ target.schema }}.audit_dbt_results (node, status, execution_time, time) values {{ results_values(results) }}"

Hi @ianterrell ,
Thanks for sharing the code. I need to enhance this macro. Is there a way to execute this macro by itself? I can use this macro using on-run-end but am unable to execute this micro using dbt run-operation.

Thanks,
Pons

Looks like, I can only execute using an on-run-end. Below is what I got from DBT documentation
"The results variable contains a list of Result objects with one element per resource that executed in the dbt job. The Result object provides access within the Jinja on-run-end context to the information that will populate the run results JSON artifact."

@trevor-petach
Copy link

@jtcohen6 How would you feel about re-opening this issue? I greatly appreciate the work you did in #3316, but I don't think it satisfies the request in this ticket - namely, an audit log of test results.

In particular, the limitations of #3316 are:

  1. Not immediately apparent whether a test was recently run (have to look at table creation timestamps), and tables for tests that have been removed from the project continue to persist.
  2. Not immediately apparent whether a test passed on its most recent run (have to count rows in the table).
  3. Not immediately apparent which tests correspond to which tables since tests with long names are truncated with a hash that is not easy to map back to the original test name.
  4. Tables are overwritten every time a test is run, so there is no historical record of test results.
  5. The compiled test queries themselves are not saved.
  6. No record of which user ran the test.

Options that I see to generate an audit log:

  1. Enable on-run-start and on-run-end hooks for dbt test.
  2. Create on-test-start and on-test-end hooks analogous to on-run-start and on-run-end, and have results object available in on-test-end context.
  3. Enable pre-hooks and post-hooks for tests.
  4. Enable inclusion of local files in macros (via {% include %}) so that run_results.json could be parsed and written to the database in a run-operation command.
  5. Snapshot all tables in the dbt_test__audit schema immediately after dbt test. Does not solve limitations 1 and 3 above.

Option 3 is @ianterrell's idea of using pre and post hooks to generate the test audit log. We do something similar to create a run audit log for some of our clients who use dbt, and it has worked very well.

Aside from the work to do the implementation, are there blockers/reasons not to add pre and post hooks to tests? I would be happy to attempt to do the implementation with some (likely substantial) guidance.

@jtcohen6
Copy link
Contributor

@trevor-petach Thanks for the thoughtful comment!

In my current view of the world, I see the solution to problems 1-6 being metadata. Just about all of the information you've enumerated is available in artifacts (run_results.json + manifest.json) produced by the dbt test command:

  1. The timestamp when the test was run, and how long it took to run (run results)
  2. The test status and number of failures (run results)
  3. The relation_name of the table where the test stored its failures (manifest)
  4. History of test statuses and number of failures over time, by saving artifacts from past runs. (If you want the exact failing records themselves, snapshotting is not a bad idea! That feels more like data than metadata, so storing it in the database feels right.)
  5. The compiled_sql of the executed test (manifest)
  6. The artifacts won't record which user ran the test (dbt doesn't store any credential information in artifacts), but it's quite possible to store custom environment metadata, e.g. by defining environment variables with the DBT_ENV_CUSTOM_ENV_ prefix

Given that this information is available in JSON artifacts, and in such a rich format, I'd rather encourage you to:

  • Access this information via the dbt Cloud Metadata API (cc @barryaron)
  • Upload those artifacts directly to your data warehouse, just as you suggest! This is possible to do in a dbt macro, at least on Snowflake, via put statement. Alternatively, your orchestration tool may be able to upload those artifacts to external storage (e.g. S3), where they're accessible to querying later on.

That said, I think you're onto something here:

Enable on-run-start and on-run-end hooks for dbt test.

I agree! I said something to this effect in response to a more recent issue: #3463. I think the basic change, turning on these hooks for the test task, would be quite straightforward, and I'd welcome a community contribution for it.

Since the Results object is available in the on-run-end context, you could push information from that object into an audit table in your database, if you so chose.

@joellabes
Copy link
Contributor

Wanted to share the approach we're taking for visibility of failing tests. I don't think it'd scale to all 1200+ tests, but right now we don't need it to - we're focused on exposing data quality issues to other teams, in the hope that showing them a chart with number of errors going down will encourage them to keep the CRM tidy 🤞

We're using dbt-utils' get_relations_by_pattern and union_relations macros, and then will snapshot the results of that.

{% set relations = dbt_utils.get_relations_by_pattern('%', 'hubspot_revenue_test__%') %}

with tests as (
    {{ dbt_utils.union_relations(relations, exclude=['_fivetran_synced']) }}
)

select 
    initcap(translate(regexp_replace(_dbt_source_relation, '["A-z\.]+hubspot_revenue_test__', ''), '_"', ' '))  as failing_test, --icky, but works to make the error friendlier
    pipeline_label as team, 
    count(*) as num_records
from tests
group by 1, 2 

As much as we'd love to get amongst the artifact goodness, I don't think Redshift doesn't have a PUT equivalent so we're probably dependent on a dbt Cloud native solution for the moment.

@trevor-petach
Copy link

@jtcohen6 Agreed that artifacts have the needed information - the issue is getting that information somewhere useful (i.e. the database). For advanced users, this is not a problem (a quick script that kicks off after dbt test completes to write run_results.json to the database), but for more basic users, it is a substantial stumbling block. Most people we work with use BigQuery, so face the same problem as @joellabes in Redshift - there is no equivalent of PUT. As far as I can tell, the dbt cloud API suffers from the same problem - I have to run the query and parse a relatively complex result on my infrastructure.

My goal is be able to build a dashboard to show recent test failures or set an alert/notification for test failures (that includes some information about what failed) for a user who just uses dbt cloud to execute dbt run and dbt test.

One minor issue with the metadata objects is that they are not available until all of the tests execute. If we had pre and post-test hooks, then we could record the results immediately into an audit table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.