Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Support for JSONL in s3 source with max_rows support #9921

Merged
merged 5 commits into from
Feb 28, 2024

Conversation

Adityamalik123
Copy link
Contributor

@Adityamalik123 Adityamalik123 commented Feb 26, 2024

Closes #9920 Support for JSONL in s3 source with max_rows support

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Feb 26, 2024
@Adityamalik123 Adityamalik123 marked this pull request as draft February 26, 2024 20:34
@Adityamalik123 Adityamalik123 changed the title added support for JSONL Support for JSONL in s3 source with max_rows support Feb 26, 2024
@Adityamalik123 Adityamalik123 marked this pull request as ready for review February 26, 2024 20:42
@hsheth2
Copy link
Collaborator

hsheth2 commented Feb 27, 2024

@Adityamalik123 I thought jsonlines support was added a while back #5725 - was that not working anymore?

@Adityamalik123
Copy link
Contributor Author

Adityamalik123 commented Feb 27, 2024

@hsheth2 #5725 tries to deserialize .json extension files using jsonlines only in case it fails to decode using ujson parser. It doesn't support .jsonl extension files.
Also, it doesn't support max_rows recipe parameter as the current implementation loads the entire json structure into memory to infer schema, while .jsonl extension can take advantage of predictable memory consumption as each line represents a single JSON object.

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need some new test cases added to test_schema_inference

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - just had one question and one suggestion

file.seek(0)
reader = jsl.Reader(file)
datastore = [obj for obj in reader.iter(type=dict, skip_invalid=True)]
datastore = []
Copy link
Collaborator

@hsheth2 hsheth2 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this simpler using itertools

Suggested change
datastore = []
datastore = [obj for obj in itertools.islice(reader.iter(type=dict, skip_invalid=True), self.max_rows)]

Copy link
Contributor Author

@Adityamalik123 Adityamalik123 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I'll update the same.

@@ -377,7 +377,7 @@ def read_file_spark(self, file: str, ext: str) -> Optional[DataFrame]:
ignoreLeadingWhiteSpace=True,
ignoreTrailingWhiteSpace=True,
)
elif ext.endswith(".json"):
elif ext.endswith(".json") or ext.endswith(".jsonl"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested/checked that spark.read.json() works with jsonl?

Copy link
Contributor Author

@Adityamalik123 Adityamalik123 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It works fine without any additional option.

image

@hsheth2 hsheth2 changed the title Support for JSONL in s3 source with max_rows support feat(ingest): Support for JSONL in s3 source with max_rows support Feb 27, 2024
@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Feb 27, 2024
@treff7es treff7es merged commit 92b1cfa into datahub-project:master Feb 28, 2024
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for JSONL in s3 source with max_rows support
3 participants