Pydantic validator #1121

cswartzvi · 2024-09-04T17:29:07Z

This PR adds pydantic integration via a new data validator and plugin. Resolves #473.

Changes

I added a file called data_quality/pydantic_validators.py with a new default validator PydanticModelValidator. This validator is dynamically added to the list of available default validators if pydantic is available - very similar, some would say identical 😄, to how the pandera validators are added. The pydantic validator is passed a model parameter that is then used to validated the output of the decorated function:

class MyModel(BaseModel):
    name: str

@check_output(model=MyModel)
def foo() -> dict:
    return {"name": "hamilton"}

I also added a plugin file plugins/h_pydantic.py with a variant of the check_output decorator that uses the return type annotation to establish the pydantic model to validate:

class MyModel(BaseModel):
    name: str

@h_pydantic.check_output()
def foo() -> MyModel:
    return MyModel(name="hamilton")

My implementation uses the pydantic TypeAdapter - mainly because pydantic.validate_call does not have an option to only check the return value. TypeAdapter allows you to specify a strict mode where type coercion is turned off, since modifying the outputs is not currently allowed (that is correct, right?). I have enabled strict mode for both the validator and the plugin.

Although I think that it is idiomatic to pydantic, I should point out that strictmode does not stop you from doing this (i.e. this passes validation):

class MyModel(BaseModel):
    name: str

@h_pydantic.check_output()
def foo() -> MyModel:
    return {name: "hamilton"}

One last thing to mention is that h_pydantic.check_output currently checks that the return type annotation is a subclass of pydantic.BaseModel. In theory, you could use pydantic to check all kinds of things (builtins, Annotated types, ...), however I was having trouble getting it to play nicely with validator resolution so I scraped it.

How I tested this

I added a file to the testing suite test_pydantic_data_quality.py that tests the validator and the check_out plugin decorator for both basic and complex cases.

Notes

There are a couple of points I wanted to bring up:

TypeAdapter is a pydantic 2.0 feature and in conflict with the [vaex] extra dependency 😞. I didn't notice this until I was updating pyproject.toml, let me know if this is a deal breaker and I will come up with an alternative implementation
I deviated from the spec in Pydantic datatype validation for hamilton nodes #473 and used model (vice schema) and pydantic.check_output (vice pydantic.check_output_schema) - I can change them back if desired, I just thought they fit better with the terminology of pydantic and the ergonomics of the pandera plugin, respectively.
I will update the documentation and the plugin docstring if you are good with the above notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

Thanks for the opportunity to dig into this!

skrawcz · 2024-09-04T18:31:16Z

@cswartzvi would you mind creating an example under data_quality showing it in action?

Otherwise just need to make sure the decorator turns up in our documentation.

elijahbenizzy

Yeah, this is really cool what you did. Nice work! Really appreciate it!

None of those strike me as blockers or even problems. Vaex is just testing, if it works then I'm not worried. Happy to support pydantic>=2.0 only. So yeah! Docs will be great, let us know if you need help getting them set up!

hamilton/data_quality/pydantic_validators.py

elijahbenizzy · 2024-09-09T05:14:39Z

Also, I don't think the tests are getting run. Can dig in tomorrow(this is a bit confusing), but it looks like:

We check for changes (this is not getting hit)
We launch these 5 jobs: https:/DAGWorks-Inc/hamilton/blob/main/.circleci/config.yml#L137
Calls this...

hamilton/.ci/test.sh

Line 28 in 004ac5e

if [[ ${TASK} == "integrations" ]]; then
Note we only have 8 tests run on main: https://app.circleci.com/pipelines/github/DAGWorks-Inc/hamilton/4033/workflows/dfad0448-70b9-4a89-bd80-e6e6c8265640/jobs/68530. And this one doesn't actually detect a change..

cswartzvi · 2024-09-10T20:04:51Z

@elijahbenizzy sorry, I got overloaded with work the past few days. I am going to add the documentation either today or tomorrow - I will take a crack at it and let you know if I run into issues. Did you ever figure out why my test were not running? If not, I can take a look into that as well.

elijahbenizzy · 2024-09-10T21:25:37Z

@elijahbenizzy sorry, I got overloaded with work the past few days. I am going to add the documentation either today or tomorrow - I will take a crack at it and let you know if I run into issues. Did you ever figure out why my test were not running? If not, I can take a look into that as well.

Thank you! I have not yet -- if you don't mind that would be much appreciated, otherwise happy to carve out some time to figure out what's happening (lots of bash scripts...).

cswartzvi · 2024-09-13T01:47:09Z

I was looking into why my tests didn't run when I pushed the initial set of commits: https://app.circleci.com/pipelines/github/DAGWorks-Inc/hamilton/4035/workflows/c7dbcc07-b6df-4fd1-9a38-a76e81b9cb53

I don't have a lot of experience with CircleCI, so please forgive me if I am missing something basic, but I noticed that the check_for_changes job in .circleci/config.yml uses the following form of the git diff command:

git diff --name-only HEAD^ HEAD | grep '^.ci\|^.circleci\|^graph_adapter_tests\|^hamilton\|^plugin_tests\|^tests\|^requirements\|setup' > /dev/null

This command compares the current commit (HEAD) to the previous commit (HEAD^) before piping it to grep. However, I pushed multiple commits at once, so if my very last commit didn't change one of the grepped directories wouldn't this test fail? It is my understanding that CircleCI runs on a push and not for each commit, is that correct?

If my hunch is right, I see three solutions

Do nothing and encourage people to make single commit pushes - might miss some tests if people forget
Change the git diff command to something like git diff --name-only origin/main...HEAD and check for changes against the main branch - surely this will result in more tests being run within a PR, but it seems safer
Beef up the check for changes with something like https:/emmeowzing/dynamic-continuation-orb

Thoughts? Still working on those docs BTW

elijahbenizzy · 2024-09-13T16:23:21Z

I was looking into why my tests didn't run when I pushed the initial set of commits: https://app.circleci.com/pipelines/github/DAGWorks-Inc/hamilton/4035/workflows/c7dbcc07-b6df-4fd1-9a38-a76e81b9cb53

I don't have a lot of experience with CircleCI, so please forgive me if I am missing something basic, but I noticed that the check_for_changes job in .circleci/config.yml uses the following form of the git diff command:
git diff --name-only HEAD^ HEAD | grep '^.ci\|^.circleci\|^graph_adapter_tests\|^hamilton\|^plugin_tests\|^tests\|^requirements\|setup' > /dev/null
This command compares the current commit (HEAD) to the previous commit (HEAD^) before piping it to grep. However, I pushed multiple commits at once, so if my very last commit didn't change one of the grepped directories wouldn't this test fail? It is my understanding that CircleCI runs on a push and not for each commit, is that correct?

If my hunch is right, I see three solutions

Do nothing and encourage people to make single commit pushes - might miss some tests if people forget

Change the git diff command to something like git diff --name-only origin/main...HEAD and check for changes against the main branch - surely this will result in more tests being run within a PR, but it seems safer

Beef up the check for changes with something like https:/emmeowzing/dynamic-continuation-orb

Thoughts? Still working on those docs BTW

I think the second one is probably the cleanest -- better to run more tests than needed than undertest... I can make the change, or if you're in the mood to, go for it! These are tests we'll catch when merging to main (it should be smart enough to diff against main), but it's nice to catch earlier.

As an edge-case, we can probably get the comparison branch rather than main. That said, at some point we'll probably switch to github actions as our CI system, so I think the simplest code-only change is easy and we shouldn't solve too much.

cswartzvi · 2024-09-13T19:13:30Z

@elijahbenizzy I would be happy to make the change. I will start with option 2 and then see how hard it is to incorporate that edge case.

elijahbenizzy

Basically there, just a few notes about docs (which I know is a WIP). To get shipped:

Add some more docstrings + add to the right docs reference (ping if you want help, but start with this slack comment)
Rebase against the check_for_changes PR after we merge to main ensure that it does as intended on this

hamilton/data_quality/default_validators.py

hamilton/plugins/h_pydantic.py

Note that vaexio/vaex#2384 has been resolved

cswartzvi · 2024-09-14T20:16:12Z

@elijahbenizzy I rebased and added that missing docstring - looks like changes were detected correctly in CI. How deep do you think I should go with the documentation? I see three potential places it could be added:

https://hamilton.dagworks.io/en/latest/concepts/function-modifiers/ [ I ended up adding this already]
https://hamilton.dagworks.io/en/latest/how-tos/run-data-quality-checks/
https:/DAGWorks-Inc/hamilton/tree/main/examples/data_quality

Does that sound reasonable?

elijahbenizzy · 2024-09-15T19:11:23Z

@elijahbenizzy I rebased and added that missing docstring - looks like changes were detected correctly in CI. How deep do you think I should go with the documentation? I see three potential places it could be added:

https://hamilton.dagworks.io/en/latest/concepts/function-modifiers/ [ I ended up adding this already]

https://hamilton.dagworks.io/en/latest/how-tos/run-data-quality-checks/

https:/DAGWorks-Inc/hamilton/tree/main/examples/data_quality

Does that sound reasonable?

Sorry for the back and forth, this is ready! Only thing, re: docs:

I think what you did is good enough out of those three. If you want to add an example it would obviously be appreciated but don't want to slow you down. The place that I'd add it is here: https://hamilton.dagworks.io/en/latest/reference/decorators/. This way we have a nice reference with the docstring.

To do so you can:

Add a file here with a brief description (call it pydantic.rst or something
Make it reference the relevant decorator (h_pydantic.check_output). mention that you can get it with check_output to make that clear.
Add a newline between this and the next line (.rst is an epic PITA): https:/DAGWorks-Inc/hamilton/pull/1121/files#diff-c0dc4b429a8dd227d562cd69242be34a5b19fc49a48f98b7267dc722767d34a3R30 -- otherwise it won't compile

If this is too much for you/you don't have time, let me know and I can add it in easily after merging. You've already done a ton! Otherwise I want to get this out in the next release (tuesday if we can!).

Really appreciate it 🫡

cswartzvi · 2024-09-17T03:49:06Z

Sorry for the back and forth, this is ready! Only thing, re: docs:

I think what you did is good enough out of those three. If you want to add an example it would obviously be appreciated but don't want to slow you down. The place that I'd add it is here: https://hamilton.dagworks.io/en/latest/reference/decorators/. This way we have a nice reference with the docstring.

To do so you can:

Add a file here with a brief description (call it pydantic.rst or something

Make it reference the relevant decorator (h_pydantic.check_output). mention that you can get it with check_output to make that clear.

Add a newline between this and the next line (.rst is an epic PITA): https:/DAGWorks-Inc/hamilton/pull/1121/files#diff-c0dc4b429a8dd227d562cd69242be34a5b19fc49a48f98b7267dc722767d34a3R30 -- otherwise it won't compile

If this is too much for you/you don't have time, let me know and I can add it in easily after merging. You've already done a ton! Otherwise I want to get this out in the next release (tuesday if we can!).

Really appreciate it 🫡

No problem, I enjoy helping! I would like to add at least one more doc - I have something queued up, just one quick question. You mentioned adding pydantic.rst to https://hamilton.dagworks.io/en/latest/reference/decorators/ do you think adding it directly to https://hamilton.dagworks.io/en/latest/reference/decorators/check_output/ might be better? This was my original idea because it doesn't break the meaning of the listed decoroator and it fits in nicely with check_output.rst (specifically I would up have to update this paragraph and add a reference). Either way works for me, just let me know.

Edit: I pushed an example of what I am talking about. If that's not what you're looking for just let me know 😄

elijahbenizzy · 2024-09-17T04:35:25Z

Sorry for the back and forth, this is ready! Only thing, re: docs:
I think what you did is good enough out of those three. If you want to add an example it would obviously be appreciated but don't want to slow you down. The place that I'd add it is here: https://hamilton.dagworks.io/en/latest/reference/decorators/. This way we have a nice reference with the docstring.
To do so you can:

Add a file here with a brief description (call it pydantic.rst or something

Make it reference the relevant decorator (h_pydantic.check_output). mention that you can get it with check_output to make that clear.

Add a newline between this and the next line (.rst is an epic PITA): https:/DAGWorks-Inc/hamilton/pull/1121/files#diff-c0dc4b429a8dd227d562cd69242be34a5b19fc49a48f98b7267dc722767d34a3R30 -- otherwise it won't compile

If this is too much for you/you don't have time, let me know and I can add it in easily after merging. You've already done a ton! Otherwise I want to get this out in the next release (tuesday if we can!).
Really appreciate it 🫡

No problem, I enjoy helping! I would like to add at least one more doc - I have something queued up, just one quick question. You mentioned adding pydantic.rst to https://hamilton.dagworks.io/en/latest/reference/decorators/ do you think adding it directly to https://hamilton.dagworks.io/en/latest/reference/decorators/check_output/ might be better? This was my original idea because it doesn't break the meaning of the listed decoroator and it fits in nicely with check_output.rst (specifically I would up have to update this paragraph and add a reference). Either way works for me, just let me know.

Edit: I pushed an example of what I am talking about. If that's not what you're looking for just let me know 😄

directly to https://hamilton.dagworks.io/en/latest/reference/decorators/check_output/ might be better? This was my original idea because it doesn't break the meaning of the listed decoroator and it fits in nicely with check_output.rst (specifically I would

Yes, I think that's pretty reasonable, perhaps a better place to put it! Only thing is to add the autoclass or whatever there (and just make it clear which maps to which). All nits though, this is pretty much good to go as far as I'm concerned!

elijahbenizzy

Looks good, let's do the last docs stuff then merge! Appoving so all I have to do is click merge next.

cswartzvi · 2024-09-17T11:41:12Z

I added the autoclass to the bottom of the page and tightened up a few of the examples. Let me know if it needs anything else!

elijahbenizzy · 2024-09-17T15:17:58Z

I added the autoclass to the bottom of the page and tightened up a few of the examples. Let me know if it needs anything else!

Looks great, thank you! Merging!

skrawcz requested a review from elijahbenizzy September 4, 2024 18:27

elijahbenizzy reviewed Sep 9, 2024

View reviewed changes

elijahbenizzy force-pushed the main branch from a92cc0e to 4c48161 Compare September 9, 2024 18:32

cswartzvi force-pushed the pydantic_validator branch from 242ff16 to f888ef0 Compare September 12, 2024 22:52

cswartzvi mentioned this pull request Sep 13, 2024

Update the check_for_changes job in CI #1140

Merged

7 tasks

elijahbenizzy reviewed Sep 13, 2024

View reviewed changes

hamilton/data_quality/default_validators.py Outdated Show resolved Hide resolved

hamilton/plugins/h_pydantic.py Show resolved Hide resolved

cswartzvi added 9 commits September 14, 2024 16:07

Add pydantic validation

b98c1d6

Add pydantic plugin

135d9fa

Add tests for pydantic validation and plugin

7ddd7f3

Add pydantic to dependencies

59ac048

Resolve issues from code review

0d6f588

Make type hints backward compatible

68ff4f8

Remove pydantic constraint for vaex

a209760

Note that vaexio/vaex#2384 has been resolved

Improve pydantic validator test import

86b5add

Add docstring to the pydantic check_output

7d11a4c

cswartzvi force-pushed the pydantic_validator branch from fa895af to 7d11a4c Compare September 14, 2024 20:07

cswartzvi added 3 commits September 14, 2024 16:52

Add initial pydantic data quality docs

c76577d

Fix pydantic support title underline

7321292

Fix pydantic strict mode link

a322da5

Fix spacing after code-block

291b4fa

cswartzvi added 3 commits September 16, 2024 23:57

Add pydantic plugin details

85bb338

Fix double quotes for code references

58da552

Remove name tags

b02b6bb

elijahbenizzy approved these changes Sep 17, 2024

View reviewed changes

Add additional docstring example; tweak wording

96571fe

elijahbenizzy merged commit ee9e4ae into DAGWorks-Inc:main Sep 17, 2024
24 checks passed

zilto mentioned this pull request Sep 23, 2024

fix:refactored pydantic extension registration #1152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic validator #1121

Pydantic validator #1121

cswartzvi commented Sep 4, 2024

skrawcz commented Sep 4, 2024

elijahbenizzy left a comment •

edited

Loading

elijahbenizzy commented Sep 9, 2024

cswartzvi commented Sep 10, 2024

elijahbenizzy commented Sep 10, 2024

cswartzvi commented Sep 13, 2024

elijahbenizzy commented Sep 13, 2024

cswartzvi commented Sep 13, 2024

elijahbenizzy left a comment

cswartzvi commented Sep 14, 2024 •

edited

Loading

elijahbenizzy commented Sep 15, 2024

cswartzvi commented Sep 17, 2024 •

edited

Loading

elijahbenizzy commented Sep 17, 2024

elijahbenizzy left a comment

cswartzvi commented Sep 17, 2024

elijahbenizzy commented Sep 17, 2024

Pydantic validator #1121

Pydantic validator #1121

Conversation

cswartzvi commented Sep 4, 2024

Changes

How I tested this

Notes

Checklist

skrawcz commented Sep 4, 2024

elijahbenizzy left a comment • edited Loading

Choose a reason for hiding this comment

elijahbenizzy commented Sep 9, 2024

cswartzvi commented Sep 10, 2024

elijahbenizzy commented Sep 10, 2024

cswartzvi commented Sep 13, 2024

elijahbenizzy commented Sep 13, 2024

cswartzvi commented Sep 13, 2024

elijahbenizzy left a comment

Choose a reason for hiding this comment

cswartzvi commented Sep 14, 2024 • edited Loading

elijahbenizzy commented Sep 15, 2024

cswartzvi commented Sep 17, 2024 • edited Loading

elijahbenizzy commented Sep 17, 2024

elijahbenizzy left a comment

Choose a reason for hiding this comment

cswartzvi commented Sep 17, 2024

elijahbenizzy commented Sep 17, 2024

elijahbenizzy left a comment •

edited

Loading

cswartzvi commented Sep 14, 2024 •

edited

Loading

cswartzvi commented Sep 17, 2024 •

edited

Loading