ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Singh, Vishwanath Pratap; Malato, Federico; Hautamäki, Ville; Sahidullah, Md|Kinnunen, Tomi

Files

Article (469.6Kb)

Self archived version

published version

Date

2024

Author(s)

Singh, Vishwanath Pratap

Malato, Federico

Hautamäki, Ville

Sahidullah, Md|Kinnunen, Tomi

Unique identifier

10.21437/Interspeech.2024-1403

Metadata

Show full item record

More information

Research Database SoleCris

Self-archived item

Citation

Singh, Vishwanath Pratap. Malato, Federico. Hautamäki, Ville. Sahidullah, Md|Kinnunen, Tomi. (2024). ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR. Interspeech, 2885-2889. 10.21437/Interspeech.2024-1403.

Rights

Licensed under

Abstract

While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fixed OAR approach in conventional data augmentation, our proposed method employs a deep Q-network (DQN) as the RL mechanism to learn the optimal dynamics of OAR throughout the wav2vec2.0 based ASR training. We conduct experiments using the LibriSpeech dataset with varying amounts of training data, specifically, the 10Min, 1H, 10H, and 100H splits to evaluate the efficacy of the proposed method under different data conditions. Our proposed method, on average, achieves a relative improvement of 4.96% over the open-source wav2vec2.0 base model on standard LibriSpeech test sets.

URI

https://erepo.uef.fi/handle/123456789/33182

Link to the original item

https://doi.org/10.21437/Interspeech.2024-1403

Publisher

ISCA

Collections

Luonnontieteiden, metsätieteiden ja tekniikan tiedekunta [1601]