Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Payload Implementation #828

Merged
merged 147 commits into from
Jun 30, 2022
Merged

Payload Implementation #828

merged 147 commits into from
Jun 30, 2022

Conversation

hepengfe
Copy link
Collaborator

@hepengfe hepengfe commented Jun 8, 2022

This PR partially implements #810

Description of changes

  • implement basic adding of Payload in DataPack and DataStore
  • implement Payload generic class with its subclasses for different modalities such as text, audio and image.
  • Payload can holds loading method and load path(assigned in reader), and they can be used together to achieve lazy loading in processor (rather than reader).
  • modified example of audio reader to show how lazy loading works in the new design.

Possible influences of this PR.

  • modularize loading/caching data of different modalities.
  • keep data independent from data ontology classes from the perspective of loading/caching such that we make sure data ontologies are lightweight

Test Conducted

Describe what test cases are included for the PR.

@hepengfe hepengfe self-assigned this Jun 8, 2022
@hepengfe hepengfe added topic: infra Core infrastructure related issues. topic:cv labels Jun 8, 2022
@hepengfe hepengfe added this to the 0.3 stable version milestone Jun 8, 2022
@hepengfe
Copy link
Collaborator Author

hepengfe commented Jun 8, 2022

For audio and image data, we don't want to import third-party package directly in the payload class so I keep a loading interface (an abstract method loading_method) such that users can/must customize the loading method.

I rewrote the AudioReader into SoundfileAudioPayload example here https:/asyml/forte/pull/828/files#diff-0a7626d45acb22318a269b5ece6049d603ad6326f8108cda7a0ed2dbc3978440. In this example, user import third-party package in the loading_method and customize their own payload class.

forte/data/ontology/top.py Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
forte/data/ontology/top.py Outdated Show resolved Hide resolved
@hunterhector
Copy link
Member

hunterhector commented Jun 8, 2022

For audio and image data, we don't want to import third-party package directly in the payload class

True.

so I keep a loading interface (an abstract method loading_method) such that users can/must customize the loading method.

I am considering whether this is a good idea since that means users will have to write customized code in the ontology class files (which are generated). This will create a few problems:

  1. Future generated code would overwrite this.
  2. The implementation cannot be passed along with a JSON spec file only.
  3. If forces someone to use one single function to do the parsing, which might be a small problem.

Actually, we don't have to make the ontology this complex, the Payload ontology only needs to store the data, but it doesn't have to know how to parse the data. Like in the text processing part, we put most of the work to the reader.

Imaging the HTMLReader, you can consider HTML texts are special texts with customized loading methods, but we won't put the loading logic in the entry.

I would suggest the following:

  1. define payload entry to store what data will be stored/cached
  2. define a metadata ontology to specify the data formats (like JPEG)
  3. write some readers, which will load the payload depending on the metadata

Note that we are not experts in image formats right now, so for steps 2 and 3, we can create examples first, and not make them core library code. Once we fully master or get comfortable with certain data types (I think it is possible for easy data types like images stored as NumPy matrices), we can move more to the library core.

forte/data/base_pack.py Outdated Show resolved Hide resolved
forte/data/data_pack.py Show resolved Hide resolved
forte/data/readers/audio_reader.py Outdated Show resolved Hide resolved
forte/processors/ir/bert/bert_based_query_creator.py Outdated Show resolved Hide resolved
ft/onto/metric.py Outdated Show resolved Hide resolved
tests/forte/data/audio_annotation_test.py Show resolved Hide resolved
tests/forte/data/audio_annotation_test.py Show resolved Hide resolved
tests/forte/data/audio_annotation_test.py Show resolved Hide resolved
tests/forte/data/readers/audio_reader_test.py Outdated Show resolved Hide resolved
@hepengfe hepengfe merged commit 8b233e4 into asyml:master Jun 30, 2022
@hepengfe hepengfe deleted the payload branch June 30, 2022 00:08
@hunterhector
Copy link
Member

hunterhector commented Jul 4, 2022

I remember clearly I requested to install enum package (https://pypi.org/project/enum34/) for compatibility issues, but this is not really addressed. requirement.txt is not enough, it must be in the setup.py. @mylibrar @hepengfe


# since we cannot pass different modality from generated ontology, and
# we don't want to import base ontology in the header of the file
# we import it here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a solution. We cannot import items that are generated back to the source code. This doesn't make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:cv topic: infra Core infrastructure related issues.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Payload and apply to more data modality Payload for loading/caching image data
3 participants