Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: JSON materializer don't support list[dict] #690

Merged
merged 4 commits into from
Feb 13, 2024

Conversation

zilto
Copy link
Collaborator

@zilto zilto commented Feb 11, 2024

At the top level, JSON files are either a mapping { } or a list [ ]. Currently, the default JSON DataSaver and DataLoader only map to a Python dict type and don't support a list[dict] annotation.

For example:

JSON_FILE_CONTENT = [
    {"name": "A"},
    {"name": "B"}
]

# this works
def node(json_data: dict) -> dict:
    return json_data

# this should work but doesn't
def from_to_bug(json_data: list[dict]) -> list[dict]:
    return json_data

# this doesn't work because of `from_`
def from_bug(json_data: list[dict]) -> dict:
    return json_data

# this doesn't work because of `to`
def to_bug(json_data: dict) -> list[dict]:
    return json_data


if __name__ == "__main__":
    from hamilton import driver
    from hamilton.io.materialization import to, from_
    import __main__
    
    dr = driver.Builder().with_modules(__main__).build()
    
    # comment out functions to avoid name/type conflicts
    dr.materialize(
        from_.json(target="json_data", path="data.json"),
        to.json(path="./out.json", id="out__json", dependencies=["from_to_bug"]),
        additional_vars=["from_to_bug"],
    )

Changes

  • Changed the type annotations for JSON DataSaver and DataLoader

How I tested this

  • the test suite runs successfully
  • the above example now works

Notes

  • ❗ one issue with supporting both dict and list[dict] is that the object might be miss-typed and still load/save. For example,
JSON_FILE_CONTENT = [
    {"name": "A"},
    {"name": "B"}
]

# with the fix, the next 2 fucnctions would work
# however, `json_data` can't be `list[dict]` and `dict`

# for the above data,
# this one fails because isinstance(json_data, dict) is False
def from_bug_alert(json_data: list[dict]) -> dict:
    return json_data

# this one works / fails silently because type(json_data) in (dict, list[dict])
def from_bug_silent(json_data: dict) -> list[dict]:
    return json_data
    
# similar logic applies with `to`, but it matters less because nothing
# depends on materializer nodes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@zilto zilto added the bug Something isn't working label Feb 11, 2024
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- maybe this should be a .jsonl format? Its not to spec -- while python json is happy with this, these are usually marked differently.

Fine either way, but we should add a unit test as well for this

@zilto
Copy link
Collaborator Author

zilto commented Feb 12, 2024

.json accepts a single JSON object which can be a mapping { } or an array [ ] at the top level

# data.json
# contains 1 object
[
    {"name": "A"},
    {"name": "B"}
]

.jsonl format uses newline (no comma separator) to store multiple JSON objects in a single file. They are parsed individually

# data.jsonl
# contains 2 objects
{"name": "a", "value": 1}
{"name": "b", "value": 11}

@zilto
Copy link
Collaborator Author

zilto commented Feb 12, 2024

Encountered the issue after downloading a dataset from Kaggle which was stored in a JSON with top-level array. Had to annotate it def from_bug_silent(json_data: dict) -> list[dict]: even though json_data is in reality a list[dict]

zilto added 2 commits February 13, 2024 14:02
The Python json library allows to save tuples and
other collections as JSON. However, they would loaded
as a list. Hence, we limit the type to dict and list[dict]
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! There are parameterizations without more than 1, but I think that's a pattern we should just kill in the rest of it (or add more parameterizations).

@zilto zilto merged commit 2c76db0 into main Feb 13, 2024
22 checks passed
@zilto zilto deleted the fix/json-materializer-typing branch February 13, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants