Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tests IFF all first-order parents are selected #2891

Closed
jtcohen6 opened this issue Nov 16, 2020 · 3 comments · Fixed by #3235
Closed

Run tests IFF all first-order parents are selected #2891

jtcohen6 opened this issue Nov 16, 2020 · 3 comments · Fixed by #3235
Labels
1.0.0 Issues related to the 1.0.0 release of dbt dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request node selection Functionality and syntax for selecting DAG nodes
Milestone

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 16, 2020

Describe the feature

See prior discussion: #1827, #2132, misc state:modified conversations

Run a test only if all first-order parents are included in the selection criteria. This would be most immediately relevant for:

  • relationships tests
  • data tests with multiple parents

This would also handle cases where a schema test has a property (tag:, state:) that its one parent model does not have. Since its single parent would not be included in the selection criteria, given the proposed condition, the test should not run.

Describe alternatives you've considered

  • How useful vs. confusing is this idea?
  • Do we change default behavior? Or add a flag (--all-parents) to enable this?

Who will this benefit?

Users who want finer-grained control over which tests they're running, especially:

  • using complex selection syntax
  • running in CI environments
@jtcohen6 jtcohen6 added the enhancement New feature or request label Nov 16, 2020
@jtcohen6 jtcohen6 added this to the Oh-Twenty [TBD] milestone Nov 16, 2020
@jtcohen6 jtcohen6 changed the title Run tests IFF all parents are selected Run tests IFF all first-order parents are selected Nov 17, 2020
@jtcohen6 jtcohen6 added the dbt tests Issues related to built-in dbt testing functionality label Dec 31, 2020
@balmasi
Copy link

balmasi commented Mar 5, 2021

I just ran into this in the following scenario for a slim CI build:

I have models A <--- B where B has a foreign key to A.

I also have a relationship test defined on model B that points to A.

I touch model A, so my slim CI build becomes dbt run -m A, followed by a dbt test -m A.

This fails since dbt selects the relationship test defined on model B because the relationship test depends on both A and B.

dbt is selecting the test if ANY parent is selected, while I was looking for it run the test if ALL parents were selected.

Shouldn't running the test when ALL dependent nodes are selected be the default behaviour?

I would much prefer to have this be the default and have a flag for the current behaviour rather than the other way around.

Are there some use-cases I'm missing that this would completely break?

Workaround
I had to dbt test --exclude test_name:relationships which is not ideal.

Jeremy pointed out that I can use the --defer feature to fallback on prod, but that comes with its own challenges of state management.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Mar 10, 2021

@balmasi Thanks for the comment! A couple of thoughts here:

Slim CI: The reason we added support for dbt test --defer in dbt v0.19.0 is exactly for the reason you're describing: It often happens that, in a Slim CI run, you will have built one parent model of a multi-pronged test but not the other(s). The defer functionality gives dbt the ability to "fail over" to the prod version of an upstream model if it's unbuilt/missing in the CI schema.

In my view, relationships tests don't square with CI builds. You're frequently working with a limited subset of data, whether that's a date cutoff or a random sample. The other three builtin dbt tests—unique, not_null, and accepted_values—are no likelier to fail on a subset vs. the complete dataset, but tests on referential integrity are more likely to fail on arbitrary subsets. For that reason, I think dbt test --exclude test_name:relationships actually makes decent sense in a CI run.

Shouldn't running the test when ALL dependent nodes are selected be the default behaviour?

I take your point here, and I'm sympathetic to this argument. This is something we've considered changing in the default behavior (#1827, #2132); I'm still figuring out how we should implement it in practice.

This isn't how selection works for other resource types: If you dbt run -m model_a+, that selects model_c even if model_a is not its only parent. Say model_c depends also on model_b: then dbt run -m model_a+ in a fresh schema would fail on model_c, because model_b is missing. dbt introduced the @ selection operation, so that you could dbt run -m @model_a and know that it would build whatever is needed.

Granted, test selection is different from other resource selection because of the "magic jump" baked into it: when I test model_a, that really means executing the tests that directly depend on model_a. In that sense, dbt test -m model_a is more like selecting -m model_a+1.

What's the appeal of that "magic jump"? It enables consistency of syntax between test and other commands, to enable things like dbt run -m model_a && dbt test -m model_a, or (someday) the same thing written as dbt build -m model_a (see #2743). We could modulate the magic, and make it only select tests if all their parents are all the way present. That still gets us the consistent syntax we want, without any surprises.

In that world, if we had a relationships test between model_a and model_b:

  1. dbt test -m model_a or -m model_b: Not selected. A parent is missing, so the "magic jump" doesn't take effect.
  2. dbt test -m model_a model_b: Yes, selected. All parents are present, the "magic jump" says go for it!
    3 dbt test -m model_a+, -m model_a+1, -m @model_a: Yes, selected. The test node is a child of model_a and directly included in the selection criteria; no "magic jump" necessary.

Today, all of the above select and execute that relationships test. In the first case, the intuitive syntax of the "magic jump" actually backfires, results in the unintuitive case you've encountered.

@drewbanin I'd be curious to get your thoughts here! I started this comment ready to say that we shouldn't do this... in the course of writing it out, I've managed to convince myself that it might just manage to thread the needle the way we'd want.

@smomen
Copy link

smomen commented Mar 26, 2021

+1-ing this. In our use case, we need to do partial dbt runs in production in a separate environment due to sensitive (source) patient data in that environment. But we’d like to share the same models, so we have a monolithic project. So, some models just don’t run in this sensitive environment.

We’d like to follow a blue/green deployment and test before we deploy our data, but the selection of irrelevant relationship tests prevents this. (more on whether this is truly irrelevant below)

So, to recap:

we run dbt run -m source:sensitive+, and then dbt test -m source:sensitive+; the latter fails because it selects relationship tests that involve source:sensitive+, and the ones where the parent is not in source:sensitive+ fail.

I’m torn on whether this is fundamentally problematic or just requires configurability.

If there’s a relationship test, testing model A -(to)-> B, and run dbt test -m B, would I expect the relationship test to run?

  • B certainly partakes in the relationship, so rerunning it might have broken a relationship b/w A (if it exists) and B... so yes?
  • But if A does not exist, I’m not sure I would fail the test - there’s no relationship to test, and that’s not a problem with B/that can be solved by changing B/rerunning dbt run -m B... and, it’s “true” that all foreign key values in A (of which there are none - feels like this is a 0 vs null dichotomy?) exist in B. But certainly would expect dbt test -m A to fail if A is the parent of the relationship.

I don’t think dbt test --exclude test_name:relationship is a viable workaround, as it throws the baby out with the bathwater - there are other legitimate relationship tests I do want to run (e.g. B -(to)-> C in the scenario above).

defering - I’m not very familiar with this, but “falling back” to another version of the model is not possible for us - there is no other version that exists.

The only two workarounds we’ve found:

  • hardcode the tests that should be excluded during these runs (unfortunate, because we now have to maintain dependencies elsewhere/it leaks knowledge about the dependency graph into our test command, creating divergence/likely missed test coverage in the future)
  • or create dummy versions of our missing models during these runs (requiring some hacking up of those models/reduced maintainability).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issues related to the 1.0.0 release of dbt dbt tests Issues related to built-in dbt testing functionality enhancement New feature or request node selection Functionality and syntax for selecting DAG nodes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants