-
Notifications
You must be signed in to change notification settings - Fork 60
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add audio support to DataPack (#585)
* Add audio support to DataPack * Fix PR comments * Update soundfile dependency and docs * Update README for extra requirements * Add soundfile to docs/requirements.txt * Update README for extra requirements * Add wikipedia extra req to README Co-authored-by: Suqi Sun <[email protected]>
- Loading branch information
Showing
13 changed files
with
265 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
This folder contains a list of data samples that are used by forte to facilitate test cases. | ||
|
||
# List of Data Samples | ||
## audio_reader_test | ||
This directory consists of audio files that are used in a unit test for verifying the AudioReader in `forte/tests/forte/data/readers/audio_reader_test.py`. Currently it contains two `.flac` files excerpted from a HuggingFace dataset called [patrickvonplaten/librispeech_asr_dummy](https://huggingface.co/datasets/patrickvonplaten/librispeech_asr_dummy) for automatic speech recognition. |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Audio Processing | ||
|
||
## Audio DataPack | ||
`DataPack` includes a payload for audio data and a metadata for sample rate. You can set them by calling the `set_audio` method: | ||
```python | ||
from forte.data.data_pack import DataPack | ||
|
||
pack: DataPack = DataPack() | ||
pack.set_audio(audio, sample_rate) | ||
``` | ||
The input parameter `audio` should be a numpy array of raw waveform and `sample_rate` should be an integer the specifies the sample rate. Now you can access these data using `DataPack.audio` and `DataPack.sample_rate`. | ||
|
||
## Audio Reader | ||
`AudioReader` supports reading in the audio data from files under a specific directory. You can set it as the reader of your forte pipeline whenever you need to process audio files: | ||
```python | ||
from forte.pipeline import Pipeline | ||
from forte.data.readers.audio_reader import AudioReader | ||
|
||
Pipeline().set_reader( | ||
reader=AudioReader(), | ||
config={"file_ext": ".wav"} | ||
).run( | ||
"path-to-audio-directory" | ||
) | ||
``` | ||
The example above builds a simple pipeline that can walk through the specified directory and load all the files with extension of `.wav`. `AudioReader` will create a `DataPack` for each file with the corresponding audio payload and the sample rate. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ Welcome to Forte's documentation! | |
|
||
examples.md | ||
ontology_generation.md | ||
audio_processing.md | ||
|
||
API | ||
==== | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,3 +37,6 @@ nltk==3.4.5 | |
# FastAPI | ||
fastapi==0.65.2 | ||
uvicorn==0.14.0 | ||
|
||
# soundfile | ||
soundfile>=0.10.3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# Copyright 2022 The Forte Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
The reader that reads audio files into Datapacks. | ||
""" | ||
import os | ||
from typing import Any, Iterator | ||
|
||
from forte.data.data_pack import DataPack | ||
from forte.data.data_utils_io import dataset_path_iterator | ||
from forte.data.base_reader import PackReader | ||
|
||
__all__ = [ | ||
"AudioReader", | ||
] | ||
|
||
|
||
class AudioReader(PackReader): | ||
r""":class:`AudioReader` is designed to read in audio files.""" | ||
|
||
try: | ||
import soundfile # pylint: disable=import-outside-toplevel | ||
except ModuleNotFoundError as e: | ||
raise ModuleNotFoundError( | ||
"AudioReader requires 'soundfile' package to be installed." | ||
" You can run 'pip install soundfile' or 'pip install forte" | ||
"[audio_ext]'. Note that additional steps might apply to Linux" | ||
" users (refer to " | ||
"https://pysoundfile.readthedocs.io/en/latest/#installation)." | ||
) from e | ||
|
||
def _collect(self, audio_directory) -> Iterator[Any]: # type: ignore | ||
r"""Should be called with param ``audio_directory`` which is a path to a | ||
folder containing audio files. | ||
Args: | ||
audio_directory: audio directory containing the files. | ||
Returns: Iterator over paths to audio files | ||
""" | ||
return dataset_path_iterator(audio_directory, self.configs.file_ext) | ||
|
||
def _cache_key_function(self, audio_file: str) -> str: | ||
return os.path.basename(audio_file) | ||
|
||
def _parse_pack(self, file_path: str) -> Iterator[DataPack]: | ||
pack: DataPack = DataPack() | ||
|
||
# Read in audio data and store in DataPack | ||
audio, sample_rate = self.soundfile.read( | ||
file=file_path, **(self.configs.read_kwargs or {}) | ||
) | ||
pack.set_audio(audio=audio, sample_rate=sample_rate) | ||
pack.pack_name = file_path | ||
|
||
yield pack | ||
|
||
@classmethod | ||
def default_configs(cls): | ||
r"""This defines a basic configuration structure for audio reader. | ||
Here: | ||
- file_ext (str): The file extension to find the target audio files | ||
under a specific directory path. Default value is ".flac". | ||
- read_kwargs (dict): A dictionary containing all the keyword | ||
arguments for `soundfile.read` method. For details, refer to | ||
https://pysoundfile.readthedocs.io/en/latest/#soundfile.read. | ||
Default value is None. | ||
Returns: The default configuration of audio reader. | ||
""" | ||
return {"file_ext": ".flac", "read_kwargs": None} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Copyright 2022 The Forte Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
""" | ||
Unit tests for AudioReader. | ||
""" | ||
import os | ||
import unittest | ||
from typing import Dict | ||
from torch import argmax | ||
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC | ||
|
||
from forte.common.configuration import Config | ||
from forte.common.resources import Resources | ||
from forte.common.exception import ProcessFlowException | ||
from forte.data.data_pack import DataPack | ||
from forte.data.readers import AudioReader | ||
from forte.pipeline import Pipeline | ||
from forte.processors.base.pack_processor import PackProcessor | ||
|
||
|
||
class TestASRProcessor(PackProcessor): | ||
""" | ||
An audio processor for automatic speech recognition. | ||
""" | ||
def initialize(self, resources: Resources, configs: Config): | ||
super().initialize(resources, configs) | ||
|
||
# Initialize tokenizer and model | ||
pretrained_model: str = "facebook/wav2vec2-base-960h" | ||
self._tokenizer = Wav2Vec2Processor.from_pretrained(pretrained_model) | ||
self._model = Wav2Vec2ForCTC.from_pretrained(pretrained_model) | ||
|
||
def _process(self, input_pack: DataPack): | ||
required_sample_rate: int = 16000 | ||
if input_pack.sample_rate != required_sample_rate: | ||
raise ProcessFlowException( | ||
f"A sample rate of {required_sample_rate} Hz is requied by the" | ||
" pretrained model." | ||
) | ||
|
||
# tokenize | ||
input_values = self._tokenizer( | ||
input_pack.audio, return_tensors="pt", padding="longest" | ||
).input_values # Batch size 1 | ||
|
||
# take argmax and decode | ||
transcription = self._tokenizer.batch_decode( | ||
argmax(self._model(input_values).logits, dim=-1) | ||
) | ||
|
||
input_pack.set_text(text=transcription[0]) | ||
|
||
|
||
class AudioReaderPipelineTest(unittest.TestCase): | ||
""" | ||
Test AudioReader by running audio processing pipelines | ||
""" | ||
|
||
def setUp(self): | ||
self._test_audio_path: str = os.path.abspath( | ||
os.path.join( | ||
os.path.dirname(os.path.abspath(__file__)), | ||
os.pardir, | ||
os.pardir, | ||
os.pardir, | ||
os.pardir, | ||
"data_samples/audio_reader_test" | ||
) | ||
) | ||
|
||
# Define and config the Pipeline | ||
self._pipeline = Pipeline[DataPack]() | ||
self._pipeline.set_reader(AudioReader()) | ||
self._pipeline.add(TestASRProcessor()) | ||
self._pipeline.initialize() | ||
|
||
def test_asr_pipeline(self): | ||
target_transcription: Dict[str, str] = { | ||
self._test_audio_path + "/test_audio_0.flac": | ||
"A MAN SAID TO THE UNIVERSE SIR I EXIST", | ||
self._test_audio_path + "/test_audio_1.flac": ( | ||
"NOR IS MISTER QUILTER'S MANNER LESS INTERESTING " | ||
"THAN HIS MATTER" | ||
) | ||
} | ||
|
||
# Verify the ASR result of each datapack | ||
for pack in self._pipeline.process_dataset(self._test_audio_path): | ||
self.assertEqual(pack.text, target_transcription[pack.pack_name]) | ||
|
||
|
||
if __name__ == "__main__": | ||
unittest.main() |