Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify design of failing test notification system #595

Open
jeancochrane opened this issue Sep 11, 2024 · 1 comment
Open

Simplify design of failing test notification system #595

jeancochrane opened this issue Sep 11, 2024 · 1 comment
Labels
dbt Related to dbt (tests, docs, schema, etc)

Comments

@jeancochrane
Copy link
Contributor

jeancochrane commented Sep 11, 2024

Background

#579 introduces email notifications for failing tests, with a system to detect the state of test failures and only notify stakeholders for newly failing rows. This system works, but it also introduces some complexity that might make the system difficult to scale, namely:

  • Configuring a new test for notifications requires adding a new test-specific seed representing the universe of known failures
  • Any generics that are used by tests that are configured to be part of the system must be updated to accept an anti_join argument with a complicated schema
  • If other teams fix rows that are part of a test's known universe of failures, a Data team member will have to manually intervene to update its seed, or else the system will fail to alert if those rows ever start to fail again
  • For tests that are part of the notification system, the failure data saved in qc.test_run_failing_row will no longer record rows that are part of the known universe of failing rows, so our analytics dashboards will not capture any work that we do to reduce the known universe of failing rows
  • Adding a new SNS topic requires updating the big chunk of YAML that we pass to the --vars flag when running run_iasworld_data_tests in the test-dbt-models workflow

If the system is successful and the Data team ends up being responsible for configuring lots of tests to use it, these issues will make it much harder to maintain the system over time.

Proposed refactor

Here is a sketch of a new iteration of the notification system that will be easier to scale:

  • Instead of a seed for every test, the known universe of test failures should be stored in one Parquet dataset that lives remotely on S3. This dataset should have a name like qc.test_known_failure and should be defined as a source in the dbt DAG.
    • The dataset should be partitioned by tablename, test name, and snapshot date (or version), and each individual Parquet file should record the set of primary keys representing the universe of known failures for each test at a moment in time.
    • A script like dbt/scripts/update_test_known_failures.py should expose an interface for managing this state file, including:
      • Managing different state files depending on the dbt environment (prod or CI)
      • Copying prod state files to CI environments
      • Adding a new snapshot of data for a specific test on the current date
      • Excluding specific rows from a snapshot by primary key
  • The dbt/scripts/run_iasworld_data_tests.py script should check whether each failing test has a universe of known failures in qc.test_known_failure, and use the most recent snapshot to exclude known failures from notifications.
  • The dbt/scripts/run_iasworld_data_tests.py script should check whether the state of each failing test has changed from the last snapshot, and should determine whether to update the snapshot using the same logic as the update_test_known_failures script. In particular, the script needs to handle the following cases:
No new failures New failures
No old failures fixed Do nothing Do nothing
Old failures fixed Create new snapshot Create new snapshot without new failures
  • The test-dbt-models workflow should automatically search for SNS topic overrides based on variable names when running the run_iasworld_data_tests script (see here for original suggestion)

Open questions

  • Should rows in qc.test_known_failure be unique by snapshot date, or snapshot version?
    • Snapshot date would be more intuitive for human readers, but precludes the possibility of multiple snapshots on a single date
    • If we go with version, we should probably preserve the date as well, to make the versions more understandable
  • Should the run_iasworld_data_tests script compare failing tests to the most recent snapshot in qc.test_known_failure, or should tests have to be configured with the snapshot date/version that corresponds to their universe of known failures?
    • Using the most recent snapshot will simplify the design, but reduce configurability

Next steps

Since this refactor will require substantial engineering time, we should only undertake it if and when we determine that the notification system is capable of gaining traction with stakeholders. Until then, we should focus on getting traction and validating the usefulness of the system.

@jeancochrane jeancochrane added the dbt Related to dbt (tests, docs, schema, etc) label Sep 11, 2024
@jeancochrane
Copy link
Contributor Author

I did a little bit of research today to determine if we could leverage dbt snapshots for the snapshotting system; my determination is that we currently can't, because they can't handle the "Old failures fixed / new failures" condition in the snapshot matrix listed above.

More specifically, we could theoretically write a check snapshot for each test using the test's generic macro to select failures. This would share one similar problem as the seed-based solution, namely the requirement to create a config file (in this case a snapshot rather than a seed) for each test, but it would fix the problem that the seed has with requiring manual management every time the universe of known failures changes. Then, to update the snapshots we could just run dbt snapshot with a --select clause to select any tests that need an updated snapshot. However, this solution can't handle the behavior "Create new snapshot without new failures," meaning we can't handle the "Old failures fixed / new failures" condition in the snapshot matrix.

It's possible that the updates to snapshot configs in dbt Core 1.9 and 1.10 will make this easier, but I doubt it. Still, I'll be keeping an eye out in case that changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt Related to dbt (tests, docs, schema, etc)
Projects
None yet
Development

No branches or pull requests

1 participant