Skip to content

This repository is the official implementation of unimodal aggregation (UMA) for automaticspeech recognition (ASR).

Notifications You must be signed in to change notification settings

Audio-WestlakeU/UMA-ASR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UMA-ASR

This repository is the official implementation of unimodal aggregation (UMA) for automaticspeech recognition (ASR).

It consists of two works:

  1. for non-autoregressive offline ASR: "Unimodal Aggregation for CTC-based Speech Recognition" (ICASSP 2024)
  2. for streaming ASR: "Mamba for Streaming ASR Combined with Unimodal Aggregation" (submitted to ICASSP 2025)

version version python

Poster 🤩 | Issues 😅 | Lab 🙉 | Contact 😘

Introduction

For Non-autoregressive Offline ASR

A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.

The proposed UMA model

For Streaming ASR

Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency.

The proposed Mamba-UMA model

Get started

  1. The proposed method is implemented using ESPnet2. So please make sure you have installed ESPnet successfully.
  2. Roll back espnet to the specified version as follows:
    git checkout v.202304
    
  3. Clone the UMA-ASR codes by:
    git clone https:/Audio-WestlakeU/UMA-ASR
    
  4. Copy the configurations of the recipes in the egs2 folder to the corresponding directory in "espnet/egs2/". At present, experiments have only been conducted on AISHELL-1, AISHELL-2, HKUST dataset. If you want to experiment on other Chinese datasets, you can refer to these configurations.
  5. Copy the files in the espnet2 folder to the corresponding folder in "espnet/espnet2", and check that the comment path in the file header matches your path.
  6. To experiment, follow the ESPnet's steps. You can implement UMA method by simply changing run.sh from the command line to our run_unimodal.sh. For example:
    ./run_unimodal.sh --stage 10 --stop_stage 13
    
    Be careful to change the permissions of the bash files to executable.
    chmod -x asr_unimodal.sh
    chmod -x run_unimodal.sh
    

Citation

You can cite this paper like:

@inproceedings{fang2024unimodal,
  title={Unimodal aggregation for CTC-based speech recognition},
  author={Fang, Ying and Li, Xiaofei},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={10591--10595},
  year={2024},
  organization={IEEE}
}

@article{fang2024mambauma,
    title={Mamba for Streaming ASR Combined with Unimodal Aggregation},
    author={Ying Fang and Xiaofei Li},
    journal={arXiv preprint arXiv:2410.00070},
    year={2023}
}

About

This repository is the official implementation of unimodal aggregation (UMA) for automaticspeech recognition (ASR).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published