-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): Support for JSONL in s3 source with max_rows support #9921
Conversation
b18569d
to
c4a338e
Compare
@Adityamalik123 I thought jsonlines support was added a while back #5725 - was that not working anymore? |
@hsheth2 #5725 tries to deserialize |
ed673fd
to
2b61aa4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also need some new test cases added to test_schema_inference
metadata-ingestion/src/datahub/ingestion/source/schema_inference/csv_tsv_jsonl.py
Outdated
Show resolved
Hide resolved
2b61aa4
to
d774da6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good - just had one question and one suggestion
file.seek(0) | ||
reader = jsl.Reader(file) | ||
datastore = [obj for obj in reader.iter(type=dict, skip_invalid=True)] | ||
datastore = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make this simpler using itertools
datastore = [] | |
datastore = [obj for obj in itertools.islice(reader.iter(type=dict, skip_invalid=True), self.max_rows)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I'll update the same.
@@ -377,7 +377,7 @@ def read_file_spark(self, file: str, ext: str) -> Optional[DataFrame]: | |||
ignoreLeadingWhiteSpace=True, | |||
ignoreTrailingWhiteSpace=True, | |||
) | |||
elif ext.endswith(".json"): | |||
elif ext.endswith(".json") or ext.endswith(".jsonl"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested/checked that spark.read.json() works with jsonl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closes #9920 Support for JSONL in s3 source with max_rows support
Checklist