Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect compilation error thrown for snapshots using check strategy and missing check_cols parameter #2787

Closed
1 of 5 tasks
kingfink opened this issue Sep 24, 2020 · 7 comments · Fixed by #2791
Closed
1 of 5 tasks
Labels
bug Something isn't working

Comments

@kingfink
Copy link
Contributor

kingfink commented Sep 24, 2020

Describe the bug

If the check_cols parameter is missing when using the check strategy for a snapshot, a misleading compilation error is thrown:

Compilation Error in snapshot snapshot_table_name (snapshots/snapshot_table_name.sql) at path []: 'updated_at' is a required property

Steps To Reproduce

Create a snapshot with the following contents:

{% snapshot snapshot_test %}

{{
  config(
    target_database='analytics',
    target_schema='analytics_snapshots',
    unique_key='id',
    strategy='check'
  )
}}

WITH
  temp AS (
    SELECT 1 AS col
  )

SELECT
  *
FROM temp

{% endsnapshot %}

Issue the following command:

$ dbt snapshot
Running with dbt=0.18.0
Encountered an error:
Compilation Error in snapshot snapshot_test (snapshots/snapshot_test.sql)
  at path []: 'updated_at' is a required property

Expected behavior

I'd expect the compilation error to be specific to the strategy used, i.e. when using check:

$ dbt snapshot
Running with dbt=0.18.0
Encountered an error:
Compilation Error in snapshot snapshot_test (snapshots/snapshot_test.sql)
  at path []: 'check_cols' is a required property when 'check' strategy is used

Screenshots and log output

See above

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.18.0
   latest version: 0.18.0

Up to date!

Plugins:
  - bigquery: 0.18.0
  - snowflake: 0.18.0
  - redshift: 0.18.0
  - postgres: 0.18.0

The operating system you're using:

The output of python --version:

Python 3.7.8

Additional context

I'd love to take a stab at fixing this issue if someone can point me in the right direction 🙂

@kingfink kingfink added bug Something isn't working triage labels Sep 24, 2020
@jtcohen6 jtcohen6 removed the triage label Sep 24, 2020
@jtcohen6
Copy link
Contributor

Good call, @kingfink!

I think this gets into some of the trickier code in dbt, which is where we parse files and construct type-secure dbt objects. I believe the relevant lines are here, defining TimestampSnapshotConfig and TimestampSnapshotConfig:
https:/fishtown-analytics/dbt/blob/873d76d72ce17446900e2e58593bfeaff096eb2a/core/dbt/contracts/graph/model_config.py#L569-L615

I'm thinking the issue may be that, since a snapshot config missing both updated_at and check_cols fails to meet both contracts, dbt returns the first error rather than the most relevant one. If there's a way to to read the strategy and check against only the corresponding snapshot config, I agree that'd be an improvement for user clarity.

@kingfink
Copy link
Contributor Author

Thanks for the context @jtcohen6 !

Are there any other examples you know of (either in the dbt codebase or somewhere else) where I can see this kind of branching logic in action?

read the strategy and check against only the corresponding snapshot config

I'm guessing trying to determine the relevance similar to this function is not what we want to do here?

https:/fishtown-analytics/dbt/blob/873d76d72ce17446900e2e58593bfeaff096eb2a/core/dbt/contracts/graph/model_config.py#L463-L472

@jtcohen6
Copy link
Contributor

_relevance_without_strategy sure feels like a promising start...

@drewbanin Any chance you could give a more specific pointer here?

@drewbanin
Copy link
Contributor

So - Snapshot config validation is kind of tricky! A snapshot can either be:

  • configured with the check strategy and should be validated as such
  • configured with the timestamp strategy and should be validated as such

A snapshot config can also be configured with a user-defined strategy, which doesn't seem to happen too often in practice. I think that is where some of the misdirection in this part of the codebase comes from. When we see a non-compliant snapshot config, it's not obvious if it's errant or if it's just a configuration for a user-configured strategy.

Rather than hacking this _relevance_without_strategy method, I'd probably just add some logic around here which explicitly checks the data.config.strategy value and uses a specific validator for the CheckSnapshotConfig or the TimestampSnapshotConfig class as needed.

https:/fishtown-analytics/dbt/blob/873d76d72ce17446900e2e58593bfeaff096eb2a/core/dbt/contracts/graph/model_config.py#L476-L491

I got some sample code working locally if it's helpful @kingfink otherwise this should be a good starting point!

@kingfink
Copy link
Contributor Author

OK I took a stab at it below, is this along the lines of what you're thinking?

@dataclass
class SnapshotWrapper(JsonSchemaMixin):
    """This is a little wrapper to let us serialize/deserialize the
    SnapshotVariants union.
    """
    config: SnapshotVariants  # mypy: ignore

    @classmethod
    def validate(cls, data: Any):
        if data['config'].get('strategy') == 'check':
            schema = _validate_schema(CheckSnapshotConfig)
        elif data['config'].get('strategy') == 'timestamp':
            schema = _validate_schema(TimestampSnapshotConfig)
        else:
            schema = _validate_schema(cls)

        validator = jsonschema.Draft7Validator(schema)

        error = jsonschema.exceptions.best_match(
            validator.iter_errors(data['config'] if data['config'].get('strategy') in ['check', 'timestamp'] else data),
            key=_relevance_without_strategy,
        )

        if error is not None:
            raise ValidationError.create_from(error) from error

Thank you so much for helping me out with this @drewbanin !

@drewbanin
Copy link
Contributor

@kingfink no joke that is actually almost exactly the diff that I made locally while testing this one :)

I have some minor thoughts about this, but they're totally the kind of thing we could iterate on in the PR. At a high-level,

  • we should use data.get('config', {}).get('strategy') to make sure that if data is malformed, we don't raise an unhelpful KeyError (this is validation code, after all!)
  • it probably makes more sense to assign a different value for data in each of these branches, eg:
    def validate(cls, data: Any):
        config = data.get('config', {})
        if config.get('strategy') == 'check':
            schema = _validate_schema(CheckSnapshotConfig)
            to_validate = config
        ....
        error = jsonschema.exceptions.best_match(
            validator.iter_errors(to_validate),
            key=_relevance_without_strategy,
        )

Just my 2 cents though - happy to litigate them in a PR if you're interested in contributing a fix for this issue :D

@kingfink
Copy link
Contributor Author

@drewbanin agree with you :)

Just opened a PR (and submitted the CLA...not sure if that resolves itself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants