Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Feather Reader and Writer Class #385

Merged
merged 7 commits into from
Sep 27, 2023

Conversation

JoJo10Smith
Copy link
Contributor

@JoJo10Smith JoJo10Smith commented Sep 26, 2023

This PR is in response to #384 where I will be adding additional parameters to the Feather functionality. I have gone ahead and set the Read and Write classes to match the other materializes (PandasFeatherReader and PandasFeatherWriter).

Pandas to_feather: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_feather.html
Additional Parameters: https://arrow.apache.org/docs/python/generated/pyarrow.feather.write_feather.html#pyarrow.feather.write_feather
Pandas read_feather: https://pandas.pydata.org/docs/reference/api/pandas.read_feather.html

Changes

pandas_extensions.py: Added the functions to read and write Feather files
my_script.py: Added the example of Feather materialization and optional to add Hamilton to PATH with directory
notebook.ipynb: Added the example of Feather materialization and optional to add Hamilton to PATH with directory
test_pandas_extensions.py: Added test cases for the read_feather and to_feather functions that are called PandasFeatherReader and PandasFeatherWriter respectively
test_load_from_data.feather: A simple Feather file with example data to test functionality
requirements-test.txt: Added the lz4 compression module which is required for compression

How I tested this

Code passed all the CIrcleCI tests and added examples of how to implement the new functionality. The new DAG looks as follows:

image

Notes

You will now be required to import the lz4 module for compression.
lz4: https://pypi.org/project/lz4/
I added this to the requirements-test.txt and NOT the requirements.txt

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@skrawcz
Copy link
Collaborator

skrawcz commented Sep 26, 2023

hamilton/plugins/pandas_extensions.py:27: in <module>
    from pandas._typing import FilePath, NpDtype, ReadBuffer, StorageOptions, WriteBuffer
E   ImportError: cannot import name 'FilePath' from 'pandas._typing' (/home/circleci/venvs/hamilton-venv/lib/python3.7/site-packages/pandas/_typing.py)

3.7 pandas typing issue.

@JoJo10Smith
Copy link
Contributor Author

Core classes and tests have been added for PandasFeatherReader and the PandasFeatherWriter. Just to confirm @skrawcz I think it would be best to keep the FeatherDataLoader as there might already be some implementations of this class in user's code. I would suggest having both the FeatherDataLoader as well as the new classes and then deprecating the FeatherDataLoader at a later stage.

TODO: add materializer examples.

@JoJo10Smith
Copy link
Contributor Author

@skrawcz Depending on your response this PR could be merged. I opted to not remove the current FeatherDataLoader class as described above. If you would like me to remove it then I can make that change but believe it is best to keep it and deprecate later. I could add a note in the docstring to use the PandasFeatherReader and PandasFeatherWriter instead.

@JoJo10Smith JoJo10Smith changed the title [WIP] Pandas Feather Reader and Writer Class Pandas Feather Reader and Writer Class Sep 26, 2023
@skrawcz
Copy link
Collaborator

skrawcz commented Sep 26, 2023

Core classes and tests have been added for PandasFeatherReader and the PandasFeatherWriter. Just to confirm @skrawcz I think it would be best to keep the FeatherDataLoader as there might already be some implementations of this class in user's code. I would suggest having both the FeatherDataLoader as well as the new classes and then deprecating the FeatherDataLoader at a later stage.

@JoJo10Smith yes that's generally best practice. But luckily that's not required here as path is the only parameter required as input to your implementation and the old one. The way things are designed is that no one "directly" depends on these classes. So you can go ahead and delete the old class, and everyone else's code should still just work!

@skrawcz skrawcz merged commit c7e1137 into DAGWorks-Inc:main Sep 27, 2023
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants