AMPS: ASR with Multimodal Paraphrase Supervision

Published in Accepted to NAACL 2025, 2025

Spontaneous and conversational multilingual speech poses significant challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we introduce AMPS, a novel technique that enhances a multilingual multimodal ASR system by incorporating paraphrase-based supervision to improve ASR performance in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. Our approach leverages paraphrases of reference transcriptions as additional supervision during training and selectively applies this paraphrase objective to utterances with poor ASR accuracy. By integrating AMPS with the state-of-the-art multimodal model SeamlessM4T, we achieve notable relative reductions in word error rates (WERs) of up to 5%. We also provide a comprehensive analysis of our system using both objective metrics and human evaluations.

Recommended citation: Parulekar, A., Gupta, A., Chattopadhyay, S., & Jyothi, P. (2024). AMPS: ASR with Multimodal Paraphrase Supervision. https://arxiv.org/abs/2411.18368
Download Paper