You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We want DataPack to support multi media data representation including text and audio. This allows DataPack to handle audio processing tasks (e.g., Audio Classification, Audio Separation and Segmentation) and can also drive several multi media use cases (e.g., automatic speech recognition (ASR)).
Describe the solution you'd like
Support audio payload in DataPack as a numpy array with sample_rate, span_unit, seq_length, and other related info added as meta data.
Currently DataPack use DataPack._text to store text data, so maybe we can use DataPack._audio to store audio payload. Correspondingly, we might need to add some basic operations on the audio data, such as DataPack.audio and DataPack.set_audio().
The type of audio data can be a numpy array, which is a commonly used data structure to store raw waveform. Python libraries like Librosa and soundfile support loading audio files from different formats (wav, flac, mp3, etc.) into a numpy array with its sample rate. We can add a AudioReader to wrap this loading operation.
We also need to store some metadata along with the waveform array. For example, sample rate is the key info that determines the unit of timestamps. Right now we have a span_unit for text, so maybe we should add a similar field for audio to handle different units (e.g., sample, segment, frame). Another example is channel, which specifies the number of channels of the audio (stereo or mono), though this can also be inferred from the shape ofDataPack._audio. Other optional metadata includes bit-depth, which is an indicator of the resolution of audio measurement. All these information can be store in the Meta class here.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
We want DataPack to support multi media data representation including text and audio. This allows DataPack to handle audio processing tasks (e.g., Audio Classification, Audio Separation and Segmentation) and can also drive several multi media use cases (e.g., automatic speech recognition (ASR)).
Describe the solution you'd like
Support audio payload in DataPack as a numpy array with sample_rate, span_unit, seq_length, and other related info added as meta data.
Currently DataPack use
DataPack._text
to store text data, so maybe we can useDataPack._audio
to store audio payload. Correspondingly, we might need to add some basic operations on the audio data, such asDataPack.audio
andDataPack.set_audio()
.The type of audio data can be a numpy array, which is a commonly used data structure to store raw waveform. Python libraries like Librosa and soundfile support loading audio files from different formats (wav, flac, mp3, etc.) into a numpy array with its sample rate. We can add a
AudioReader
to wrap this loading operation.We also need to store some metadata along with the waveform array. For example,
sample rate
is the key info that determines the unit of timestamps. Right now we have aspan_unit
for text, so maybe we should add a similar field for audio to handle different units (e.g., sample, segment, frame). Another example ischannel
, which specifies the number of channels of the audio (stereo or mono), though this can also be inferred from the shape ofDataPack._audio
. Other optional metadata includesbit-depth
, which is an indicator of the resolution of audio measurement. All these information can be store in theMeta
class here.The text was updated successfully, but these errors were encountered: