AMPS: ASR with Multimodal Paraphrase Supervision

Published in Published at NAACL 2025 (Main conference), 2025

Spontaneous and conversational multilingual speech poses significant challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we introduce AMPS, a novel technique that enhances a multilingual multimodal ASR system by incorporating paraphrase-based supervision to improve ASR performance in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. Our approach leverages paraphrases of reference transcriptions as additional supervision during training and selectively applies this paraphrase objective to utterances with poor ASR accuracy. By integrating AMPS with the state-of-the-art multimodal model SeamlessM4T, we achieve notable relative reductions in word error rates (WERs) of up to 5%. We also provide a comprehensive analysis of our system using both objective metrics and human evaluations.

Recommended citation: Abhishek Gupta, Amruta Parulekar, Sameep Chattopadhyay, and Preethi Jyothi. 2025. AMPS: ASR with Multimodal Paraphrase Supervision. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 404–413, Albuquerque, New Mexico. Association for Computational Linguistics.
Download Paper | Download Slides