Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2268] [Bug] dbt-core >= 1.4.2 manifests not passing v8 schema validation #7119

Closed
2 tasks done
dlawin opened this issue Mar 3, 2023 · 12 comments
Closed
2 tasks done
Labels
artifacts bug Something isn't working

Comments

@dlawin
Copy link

dlawin commented Mar 3, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

I'm noticing that manifests generated by dbt-core versions 1.4.2, 1.4.3, and 1.4.4 are not passing json schema validation based on the schema here:

Expected Behavior

I would expect this validation to pass or for a v9 manifest to be available.

Steps To Reproduce

  1. Generate manifest.json for one of the mentioned versions (I used jaffle_shop)
import json
from jsonschema import validate
import yaml
import requests
manifest_path = "/Users/dan/Desktop/manifest.json"
r = requests.get(url = "https://schemas.getdbt.com/dbt/manifest/v8.json")

schema_str = r.content

schema = json.loads(schema_str)
with open(manifest_path, "r", encoding="utf-8") as fp:
            manifest_dict = yaml.safe_load(fp)

validate(
    instance=manifest_dict
    , schema=schema
)

Relevant log output

'seed' is not one of ['analysis']

Failed validating 'enum' in schema[0]['properties']['resource_type']:
    {'enum': ['analysis'], 'type': 'string'}

On instance['resource_type']:
    'seed'

Environment

- OS: OSX 13.2
- Python 3.10.9
- dbt-core: 1.4.2, 1.4.3, 1.4.4

Which database adapter are you using with dbt?

other (mention it in "Additional Context")

Additional Context

Noticed this with any adapter I tested: snowflake, redshift, postgres, databricks

@dlawin dlawin added bug Something isn't working triage labels Mar 3, 2023
@github-actions github-actions bot changed the title [Bug] dbt-core >= 1.4.2 manifests not passing v8 schema validation [CT-2268] [Bug] dbt-core >= 1.4.2 manifests not passing v8 schema validation Mar 3, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 3, 2023

@dlawin Thanks for opening!

There are some known problems with the library (hologram) we're currently using to auto-generate JSONSchemas from our Python dataclasses. This was a previous bug report that sounds similar:

More discussion:

I don't think there's a quick fix here ... even though it's very quick to encounter this bug as soon as you try to use the auto-generated JSONSchemas for actual validation :(

@dlawin
Copy link
Author

dlawin commented Mar 3, 2023

For context, I contribute to a couple open source tools that utilize these artifacts and their json schemas:
https:/yu-iskw/dbt-artifacts-parser

Subsequently using that in:
https:/datafold/data-diff

For now I think I will need to limit to a version < 1.4.2

@gshank
Copy link
Contributor

gshank commented Mar 3, 2023

I suspect that the problem is not with the manifest.json or even with the jsonschema, but with the fact that the jsonschema validate function cannot distinguish between types of nodes and is validating using the wrong part of the schema. In our Python code we have to explicitly load serialized nodes by resource_type, or we end up with incorrectly instantiated nodes. Jsonschema validate with only dictionary input is probably not capable of using the right part of the schema.

I looked at the local copies of the generated schemas, and they are correct for resource_type. From the lines included above, it looks like jsonschema is validating a seed node using the analysis node schema.

What is the goal of doing the jsonschema validate? Perhaps there's something else that can serve the same purpose.

@gshank
Copy link
Contributor

gshank commented Mar 3, 2023

I think that the jsonschemas serve more as documentation of what to expect in the manifest.json. I don't think that it can be usefully used to validate the whole manifest. If you want to go through and validate the individual nodes and call out the correct nodes by resource_type, that might possibly work.

@jtcohen6 jtcohen6 removed the triage label Mar 3, 2023
@dlawin
Copy link
Author

dlawin commented Mar 6, 2023

What is the goal of doing the jsonschema validate? Perhaps there's something else that can serve the same purpose.
@gshank

The validation is not actually the goal, the jsonschema is used to parse files to objects for manifest, run results, and sources here. (it uses https://koxudaxi.github.io/datamodel-code-generator/ to generate the object code)

If the jsonschema is inaccurate in that it doesn't validate the actual manifest files, the objects created are also inaccurate representations

https:/yu-iskw/dbt-artifacts-parser

@dlawin
Copy link
Author

dlawin commented Mar 6, 2023

I think that the jsonschemas serve more as documentation of what to expect in the manifest.json. I don't think that it can be usefully used to validate the whole manifest. If you want to go through and validate the individual nodes and call out the correct nodes by resource_type, that might possibly work.

To that end, I could probably update the generated classes here so that they handle the seeds https:/yu-iskw/dbt-artifacts-parser/blob/main/dbt_artifacts_parser/parsers/manifest/manifest_v8.py

@harshach
Copy link

harshach commented Mar 8, 2023

@gshank, How can a client depend on any schema if it's not expected to validate the resulting file based on a schema? Are there other schemas we can rely on to parse manifest and catalog files?
What options we have to understand which elements are required vs optional?

@yu-iskw
Copy link
Contributor

yu-iskw commented Mar 9, 2023

If we use the latest v8.json in the dbt-core repository, the subsequent code works locally. The manifest.json was created from the jaffle shop project with dbt 1.4.3. I didn't see the error which @dlawin reported. Although I still need to dive into it, using the latest JSON schema can be a quick solution.

import json
from jsonschema import validate
import yaml
import requests
manifest_path = "/Users/yu/local/src/github/jaffle_shop/target/manifest.json"
r = requests.get(url = "https://raw.githubusercontent.com/dbt-labs/dbt-core/main/schemas/dbt/manifest/v8.json")

schema_str = r.content

schema = json.loads(schema_str)
with open(manifest_path, "r", encoding="utf-8") as fp:
    manifest_dict = yaml.safe_load(fp)

validate(
    instance=manifest_dict
    , schema=schema
)

@yu-iskw
Copy link
Contributor

yu-iskw commented Mar 9, 2023

@gshank BTW, can you tell me how you usually generate the JSON schemas like https:/dbt-labs/dbt-core/tree/main/schemas/dbt/manifest locally?

@dlawin
Copy link
Author

dlawin commented Mar 10, 2023

Oh I see, the updated version is not hosted here

e.g. doesn't match https://raw.githubusercontent.com/dbt-labs/dbt-core/main/schemas/dbt/manifest/v8.json (accurate one)

@jtcohen6
Copy link
Contributor

Oh I see, the updated version is not hosted here

e.g. doesn't match https://raw.githubusercontent.com/dbt-labs/dbt-core/main/schemas/dbt/manifest/v8.json (accurate one)

I think this one is on me!! dbt-labs/schemas.getdbt.com#19

Just merged :)

@yu-iskw
Copy link
Contributor

yu-iskw commented Mar 17, 2023

@jtcohen6 thank you. I will check the hosted schema is updated later.

BTW, can you please tell me that, if you know? As I contributed to improving a schema before, I would like to know how to update the schema too.
#7119 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
artifacts bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants