stitchfix · elijahbenizzy · May 9, 2022 · May 31, 2022 · Jun 1, 2022 · Jun 1, 2022
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -26,7 +26,7 @@ jobs:
  name: run tests
  command: |
  . venv/bin/activate
- python -m pytest --cov=hamilton tests/
+ python -m pytest --cov=hamilton tests/ --ignore tests/integrations
 
  build-py37:
  docker:
@@ -50,7 +50,7 @@ jobs:
  name: run tests
  command: |
  . venv/bin/activate
- python -m pytest --cov=hamilton tests/
+ python -m pytest --cov=hamilton tests/ --ignore tests/integrations
 
  build-py38:
  docker:
@@ -74,7 +74,7 @@ jobs:
  name: run tests
  command: |
  . venv/bin/activate
- python -m pytest --cov=hamilton tests/
+ python -m pytest --cov=hamilton tests/ --ignore tests/integrations
 
  build-py39:
  docker:
@@ -98,7 +98,7 @@ jobs:
  name: run tests
  command: |
  . venv/bin/activate
- python -m pytest --cov=hamilton tests/
+ python -m pytest --cov=hamilton tests/ --ignore tests/integrations
 
  pre-commit:
  docker:
@@ -202,6 +202,86 @@ jobs:
  command: |
  . venv/bin/activate
  python -m pytest graph_adapter_tests/h_spark
+ integrations-py36:
+ docker:
+ - image: circleci/python:3.6
+ steps:
+ - checkout
+ - run:
+ name: install all python dependencies for integrations
+ command: |
+ python -m venv venv || virtualenv venv
+ . venv/bin/activate
+ python --version
+ pip --version
+ pip install -e .[pandera] # TODO -- add more as we add more integrations
+ pip install -r requirements-test.txt
+ # run tests!
+ - run:
+ name: run tests
+ command: |
+ . venv/bin/activate
+ python -m pytest tests/integrations
+ integrations-py37:
+ docker:
+ - image: circleci/python:3.7
+ steps:
+ - checkout
+ - run:
+ name: install all python dependencies for integrations
+ command: |
+ python -m venv venv || virtualenv venv
+ . venv/bin/activate
+ python --version
+ pip --version
+ pip install -e .[pandera] # TODO -- add more as we add more integrations
+ pip install -r requirements-test.txt
+ # run tests!
+ - run:
+ name: run tests
+ command: |
+ . venv/bin/activate
+ python -m pytest tests/integrations
+ integrations-py38:
+ docker:
+ - image: circleci/python:3.8
+ steps:
+ - checkout
+ - run:
+ name: install all python dependencies for integrations
+ command: |
+ python -m venv venv || virtualenv venv
+ . venv/bin/activate
+ python --version
+ pip --version
+ pip install -e .[pandera] # TODO -- add more as we add more integrations
+ pip install -r requirements-test.txt
+ # run tests!
+ - run:
+ name: run tests
+ command: |
+ . venv/bin/activate
+ python -m pytest tests/integrations
+ integrations-py39:
+ docker:
+ - image: circleci/python:3.9
+ steps:
+ - checkout
+ - run:
+ name: install all python dependencies for integrations
+ command: |
+ python -m venv venv || virtualenv venv
+ . venv/bin/activate
+ python --version
+ pip --version
+ pip install -e .[pandera] # TODO -- add more as we add more integrations
+ pip install -r requirements-test.txt
+ # run tests!
+ - run:
+ name: run tests
+ command: |
+ . venv/bin/activate
+ python -m pytest tests/integrations
 workflows:
  version: 2
  unit-test-workflow:
@@ -215,3 +295,7 @@ workflows:
  - dask-py37
  - ray-py37
  - spark-py38
+ - integrations-py36
+ - integrations-py37
+ - integrations-py38
+ - integrations-py39
diff --git a/data_quality.md b/data_quality.md
@@ -0,0 +1,85 @@
+# Data Quality
+
+Hamilton has a simple but powerful data quality capability. This enables you to write functions
+that have assertion on the output. For example...
+
+```python
+import pandas as pd
+import numpy as np
+from hamilton.function_modifiers import check_output
+
+@check_output(
+ datatype=np.int64,
+ data_in_range=(0,100),
+ importance="warn",
+)
+def some_int_data_between_0_and_100() -> pd.Series:
+ pass
+```
+
+In the above, we run two assertions:
+
+1. That the series has an np.int64 datatype
+2. That every item in the series in between 0 and 100
+
+Furthermore, the workflow does not fail when this dies. Rather, it logs a warning
+More about configuring that later, but you can see its specified in the `importance` parameter above.
+
+## Design
+
+To add data quality validation, we run an additional computational step in your workflow after function calculation.
+See comments on the `BaseDataValidationDecorator` class for how it works.
+
+## Default Validators
+
+The available default validators are listed in the variable `AVAILABLE_DEFAULT_VALIDATORS`
+in `default_validators.py`. To add more, please implement the class in that file then add to the list.
+There is a test that ensures that everything is added to that list.
+
+## Custom Validators
+
+To add a custom validator, you need to implement the class `DataValidator`. You can then use the
+`@check_output_custom` decorator to run it on a function. For example:
+
+```python
+import pandas as pd
+import numpy as np
+
+@check_output_custom(AllPrimeValidator(...))
+def prime_number_generator(number_of_primes_to_generate: int) -> pd.Series:
+ pass
+```
+
+## Urgency Levels
+
+Currently there are two available urgency level:
+
+1. "warn"
+2. "fail"
+
+They do exactly as you'd expect. "warn" logs the failure to the terminal and continues on. "fail"
+raises an exception in the final node.
+
+Limitations/future work are as follows:
+
+1. Currently the actions are hardcoded. In the future, we will be considering adding
+special actions for each level that one can customize...
+2. One can only disable data quality checks by commenting out the decorator. We intend to allow node-specific overrides.
+3. Currently the data quality results apply to every output of that function. E.G. if it runs `extract_columns`
+it executes on every column that's extracted.
+
+## Handling the results
+
+We utilize tags to index nodes that represent data quality. All data-quality related tags start with the
+prefix `hamilton.data_quality`. Currently there are two:
+
+1. `hamilton.data_quality.contains_dq_results` -- this is a boolean that tells
+whether a node outputs a data quality results. These are nodes that get injected when
+a node is decorated, and can be queried by the end user.
+2. `hamilton.data_quality.source_node` -- this contains the name of the source_node
+the data to which the data quality points.
+
+Note that these tags will not be present if the node is not related to data quality --
+don't assume they're in every node.
+
+To query one can simply filter for all the nodes that contain these tags and access the results!
diff --git a/decorators.md b/decorators.md
@@ -248,3 +248,30 @@ desired_outputs = [o.name for o in all_possible_outputs
  if 'my_tag_value' == o.tags.get('my_tag_key')]
 output = dr.execute(desired_outputs)
 ```
+
+## @check_output
+
+The `@check_output` decorator enables you to add simple data quality checks to your code.
+
+For example:
+
+```python
+import pandas as pd
+import numpy as np
+from hamilton.function_modifiers import check_output
+
+@check_output(
+ datatype=np.int64,
+ data_in_range=(0,100),
+)
+def some_int_data_between_0_and_100() -> pd.Series:
+ pass
+```
+
+The check_output validator takes in arguments that each correspond to one of the default validators.
+These arguments tell it to add the default validator to the list. The above thus creates
+two validators, one that checks the datatype of the series, and one that checks whether the data is in a certain range.
+
+Note that you can also specify custom decorators using the `@check_output_custom` decorator.
+
+See [data_quality](data_quality.md) for more information on available validators and how to build custom ones.
diff --git a/hamilton/data_quality/__init__.py b/hamilton/data_quality/__init__.py