Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance issues with schema parsing #2862

Closed
gshank opened this issue Oct 30, 2020 · 3 comments · Fixed by #3034
Closed

Investigate performance issues with schema parsing #2862

gshank opened this issue Oct 30, 2020 · 3 comments · Fixed by #3034
Assignees

Comments

@gshank
Copy link
Contributor

gshank commented Oct 30, 2020

In a large dbt project, the schema parsing took 4 minutes out of 5.5 minutes total. (Schema parsing is parsing the yaml files in the model directories.) Currently we convert yaml to json, and use json schemas generated by hologram (a python package which generates json schemas from Python classes) to validate the json.

We need to do finer-grained performance analysis and investigation to determine which parts are slowest and come up with a plan to address the issues. Some possibilities include, but are not limited to, storing json schemas in a cache or on disk, using fast-jsonschema, and using some other validation method such as rx or pykwality. Another possibility to look at is to do a quick check for actionable parts of the yaml files -- the project we are testing included lots of yaml files with no actual effect on the parsed manifest.

@jtcohen6 jtcohen6 added this to the Kiyoshi Kuromiya milestone Nov 6, 2020
@jtcohen6
Copy link
Contributor

jtcohen6 commented Nov 13, 2020

Per our conversations over the past week, I think a good first step would be trying either:

  • to adapt hologram to store json schemas in a cache or on disk, rather than inspecting and validating objects on every call
  • to swap out hologram for a faster, less flexible library that works in a similar vein: mashumaro (#2885) or fast-jsonschema, just to prove that it's possible, even if we need to comment out / hack together a bunch of code in the process

In either case, there's some immediate benefit in doing (the least possible) legwork to generalize JSONSchemaMixin as an intermediary, so that we can avoid repeated updates of from hologram import JsonSchemaMixin in dozens of files.

@gshank gshank self-assigned this Nov 13, 2020
@gshank
Copy link
Contributor Author

gshank commented Dec 3, 2020

Analysis of parsing performance determined that creation of jsonschemas was a very small percentage of the parsing time and the schemas are already stored in classes. The hologram 'get_fields' code instrospected the dataclass fields on every call to 'from_dict' and that consumed a lot of time. Instituting class caching of 'mapped_fields' decreased schema parsing time by 44 seconds on our large test project. Removing the 'deepcopy' on fqn_search decreased time by about 20 seconds. Adding code to support use of libyaml in pyyaml (and using an appropriate python and pyyaml package) decreased time by about 20 seconds.

Removing the validation performed in from_dict showed (for the test project with no validation errors) that it reduced time by 20 seconds in schema parsing, 27 seconds in all parsing. If we find a more efficient validator there are opportunities for improvement here.

The other main opportunity for performance improvement is in the 'render_update' done for tests in schema files. Of the remaining 140 seconds for schema parsing in our test project, 76 seconds are spent updating test nodes via the macro rendering process. A lot of this time is spent creating macro-objects-that-act-like-jinja-functions to include in "context" that's provided to jinja. Our large test project has about 42k tests and 1450 macros. The information in the macros is copied into 'MacroGenerator' objects for every macro 42k times. We are considering a number of ways to speed this up, such as by limiting the macro-generator functions produced on a per-test basis, marking some macros as 'static', etc.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 9, 2021

I'm going to close this issue now that we've merged #3034. Even though (by its own admission) the PR is only a partial resolution of the issues we began broadly identifying here, we'll be opening more follow-on issues (such as #3048) to track the next steps in our performance work. We know a lot more now, and we can identify targeted opportunities for improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants