Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Data quality next plans POC #149

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
d4d4f7a
Adds initial data quality decorator
elijahbenizzy May 9, 2022
40538f4
Adds a few default data validators
elijahbenizzy May 31, 2022
4668e45
Adds hook for custom validators
elijahbenizzy Jun 1, 2022
75a142c
Adds some (WIP) documentation for data quality
elijahbenizzy Jun 1, 2022
bc4c17e
Adds test to ensure all base default validators are added to the
elijahbenizzy Jun 1, 2022
7ecde0f
Adds support for layering data quality decorators
elijahbenizzy Jun 1, 2022
8ce8399
Adds NansAllowedValidator
elijahbenizzy Jun 2, 2022
b3f6e37
Adds actions for failures.
elijahbenizzy Jun 2, 2022
14b9f73
Adds validation to ensure that all validators with the same arg have the
elijahbenizzy Jun 3, 2022
9478e65
Adds tags for data quality nodes
elijahbenizzy Jun 7, 2022
8308c36
Adds end-to-end tests for data quality
elijahbenizzy Jun 7, 2022
33a19ce
Adds tests to ensure that constants don't change
elijahbenizzy Jun 7, 2022
1d312a4
Adds formatting change for pre-commit
elijahbenizzy Jun 7, 2022
aec2495
Attempt to fix imports
elijahbenizzy Jun 8, 2022
c033706
Changes imports to module-specific rather than individual classes/fun…
elijahbenizzy Jun 14, 2022
5d1d742
Small changes for PR
elijahbenizzy Jun 14, 2022
8c9bbb1
Removes currently unecessary dependencies/config
elijahbenizzy Jun 28, 2022
c37232c
Removes unecessary code
elijahbenizzy Jul 2, 2022
d936f0c
Adds pandera integration for data quality
elijahbenizzy Jul 3, 2022
7eaa049
Adds sections to test external integrations in config.yml
elijahbenizzy Jul 3, 2022
b1fc1ef
Moves BaseDefaultValidator to base
elijahbenizzy Jul 4, 2022
8891d72
Proof of concept that we can easily add a profile step.
elijahbenizzy Jul 4, 2022
d4f55aa
Proof of concept that we can override nodes with configs
elijahbenizzy Jul 4, 2022
4e6f081
Adds ability to specify applies_to in data quality decorator.
elijahbenizzy Jul 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 88 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
name: run tests
command: |
. venv/bin/activate
python -m pytest --cov=hamilton tests/
python -m pytest --cov=hamilton tests/ --ignore tests/integrations

build-py37:
docker:
Expand All @@ -50,7 +50,7 @@ jobs:
name: run tests
command: |
. venv/bin/activate
python -m pytest --cov=hamilton tests/
python -m pytest --cov=hamilton tests/ --ignore tests/integrations

build-py38:
docker:
Expand All @@ -74,7 +74,7 @@ jobs:
name: run tests
command: |
. venv/bin/activate
python -m pytest --cov=hamilton tests/
python -m pytest --cov=hamilton tests/ --ignore tests/integrations

build-py39:
docker:
Expand All @@ -98,7 +98,7 @@ jobs:
name: run tests
command: |
. venv/bin/activate
python -m pytest --cov=hamilton tests/
python -m pytest --cov=hamilton tests/ --ignore tests/integrations

pre-commit:
docker:
Expand Down Expand Up @@ -202,6 +202,86 @@ jobs:
command: |
. venv/bin/activate
python -m pytest graph_adapter_tests/h_spark
integrations-py36:
docker:
- image: circleci/python:3.6
steps:
- checkout
- run:
name: install all python dependencies for integrations
command: |
python -m venv venv || virtualenv venv
. venv/bin/activate
python --version
pip --version
pip install -e .[pandera] # TODO -- add more as we add more integrations
pip install -r requirements-test.txt
# run tests!
- run:
name: run tests
command: |
. venv/bin/activate
python -m pytest tests/integrations
integrations-py37:
docker:
- image: circleci/python:3.7
steps:
- checkout
- run:
name: install all python dependencies for integrations
command: |
python -m venv venv || virtualenv venv
. venv/bin/activate
python --version
pip --version
pip install -e .[pandera] # TODO -- add more as we add more integrations
pip install -r requirements-test.txt
# run tests!
- run:
name: run tests
command: |
. venv/bin/activate
python -m pytest tests/integrations
integrations-py38:
docker:
- image: circleci/python:3.8
steps:
- checkout
- run:
name: install all python dependencies for integrations
command: |
python -m venv venv || virtualenv venv
. venv/bin/activate
python --version
pip --version
pip install -e .[pandera] # TODO -- add more as we add more integrations
pip install -r requirements-test.txt
# run tests!
- run:
name: run tests
command: |
. venv/bin/activate
python -m pytest tests/integrations
integrations-py39:
docker:
- image: circleci/python:3.9
steps:
- checkout
- run:
name: install all python dependencies for integrations
command: |
python -m venv venv || virtualenv venv
. venv/bin/activate
python --version
pip --version
pip install -e .[pandera] # TODO -- add more as we add more integrations
pip install -r requirements-test.txt
# run tests!
- run:
name: run tests
command: |
. venv/bin/activate
python -m pytest tests/integrations
workflows:
version: 2
unit-test-workflow:
Expand All @@ -215,3 +295,7 @@ workflows:
- dask-py37
- ray-py37
- spark-py38
- integrations-py36
- integrations-py37
- integrations-py38
- integrations-py39
85 changes: 85 additions & 0 deletions data_quality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Data Quality

Hamilton has a simple but powerful data quality capability. This enables you to write functions
that have assertion on the output. For example...

```python
import pandas as pd
import numpy as np
from hamilton.function_modifiers import check_output

@check_output(
datatype=np.int64,
data_in_range=(0,100),
importance="warn",
)
def some_int_data_between_0_and_100() -> pd.Series:
pass
```

In the above, we run two assertions:

1. That the series has an np.int64 datatype
2. That every item in the series in between 0 and 100

Furthermore, the workflow does not fail when this dies. Rather, it logs a warning
More about configuring that later, but you can see its specified in the `importance` parameter above.

## Design

To add data quality validation, we run an additional computational step in your workflow after function calculation.
See comments on the `BaseDataValidationDecorator` class for how it works.

## Default Validators

The available default validators are listed in the variable `AVAILABLE_DEFAULT_VALIDATORS`
in `default_validators.py`. To add more, please implement the class in that file then add to the list.
There is a test that ensures that everything is added to that list.

## Custom Validators

To add a custom validator, you need to implement the class `DataValidator`. You can then use the
`@check_output_custom` decorator to run it on a function. For example:

```python
import pandas as pd
import numpy as np

@check_output_custom(AllPrimeValidator(...))
def prime_number_generator(number_of_primes_to_generate: int) -> pd.Series:
pass
```

## Urgency Levels

Currently there are two available urgency level:

1. "warn"
2. "fail"

They do exactly as you'd expect. "warn" logs the failure to the terminal and continues on. "fail"
raises an exception in the final node.

Limitations/future work are as follows:

1. Currently the actions are hardcoded. In the future, we will be considering adding
special actions for each level that one can customize...
2. One can only disable data quality checks by commenting out the decorator. We intend to allow node-specific overrides.
3. Currently the data quality results apply to every output of that function. E.G. if it runs `extract_columns`
it executes on every column that's extracted.

## Handling the results

We utilize tags to index nodes that represent data quality. All data-quality related tags start with the
prefix `hamilton.data_quality`. Currently there are two:

1. `hamilton.data_quality.contains_dq_results` -- this is a boolean that tells
whether a node outputs a data quality results. These are nodes that get injected when
a node is decorated, and can be queried by the end user.
2. `hamilton.data_quality.source_node` -- this contains the name of the source_node
the data to which the data quality points.

Note that these tags will not be present if the node is not related to data quality --
don't assume they're in every node.

To query one can simply filter for all the nodes that contain these tags and access the results!
27 changes: 27 additions & 0 deletions decorators.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,3 +248,30 @@ desired_outputs = [o.name for o in all_possible_outputs
if 'my_tag_value' == o.tags.get('my_tag_key')]
output = dr.execute(desired_outputs)
```

## @check_output

The `@check_output` decorator enables you to add simple data quality checks to your code.

For example:

```python
import pandas as pd
import numpy as np
from hamilton.function_modifiers import check_output

@check_output(
datatype=np.int64,
data_in_range=(0,100),
)
def some_int_data_between_0_and_100() -> pd.Series:
pass
```

The check_output validator takes in arguments that each correspond to one of the default validators.
These arguments tell it to add the default validator to the list. The above thus creates
two validators, one that checks the datatype of the series, and one that checks whether the data is in a certain range.

Note that you can also specify custom decorators using the `@check_output_custom` decorator.

See [data_quality](data_quality.md) for more information on available validators and how to build custom ones.
Empty file.
Loading