A Visual-Acoustic Modeling Framework for Robust Dysarthric Speech Recognition Using Synthetic Visual Augmentation and Transfer Learning

P. Hemalatha; K. Vinay Kumar; Uppalapati Srilakshmi; Putta Brundavani

doi:10.5750/ijme.v167iA2(S).1634

Published: Aug 19, 2025

DOI: https://doi.org/10.5750/ijme.v167iA2(S).1634

Keywords:

Dysarthric speech recognition visual-acoustic modeling speech vision (SV) data augmentation generative adversarial networks (GANS) transfer learning spectrogram-based ASR phoneme variability UA-speech dataset inclusive speech technologys

P. Hemalatha

Department of Computer Science and Engineering, Vel Tech Rangarajan Dr.Sagunthala R&D, Institute of Science and Technology, Chennai, India

Dr. K. Vinay Kumar

Assistant Professor, Department of CSE (AI&ML), Kakatiya Institute of Technology and Science, Hanamkonda, Koukonda, Warangal, Telangana 506015, India

Dr. Uppalapati Srilakshmi

Professor, Dept. of CSE, Sridevi Womens Engineering College, Vattinagulapally, Gachibowli, Hyderabad, India

Dr. Putta Brundavani

Associate Professor, Department of ECE, RSR Engineering College, Kavali, SPSR Nellore 524142 Andhra Pradesh, India

Abstract

Dysarthria is a motor speech disorder that affects an individual's ability to control his/her muscles, which seriously affects with their ability to communicate and to perform digital interaction. Automatic Speech Recognition (ASR) systems have made tremendous improvements but are limited to dysarthric speech, specifically in severe cases with they are unable to describe phonemes consistently. Also, it is provided with insufficient training data and unintuitive phoneme labelling. We present a visual acoustic modelling technique in a dysarthric-targeted ASR system. We suggest Speech Vision (SV). SV does not just depend on the audio but transforms the speech to visual spectrogram representations and trains the deep neural networks to identify the shape of the phoneme rather than the phoneme's variability when spoken. This reduces the solution from the traditional acoustic phoneme modelling that is required to address central dysarthric speech challenges. Specifically, to face data scarcity, SV uses visual data augmentation by producing synthesized dysarthric spectrograms from Generative Adversial Networks (GANs) and time-frequency distortions. Moreover, transfer learning is applied to utilize pre-trained healthy speech models to dysarthric speech for more robustness and generalization. We compare SV against the existing systems, DeepSpeech, DysarthricGAN-ASR, and Transfer-ASR, using the UA-Speech dataset. In 67% of the speakers, SV increased the accuracy of recognition by an average of 18.5%, with a significant reduction in average Word Error Rate (WER), particularly for severe dysarthria. By adopting visual learning, synthetic augmentation, and transfer learning in a single pipeline, SV is a new solution to overcome the problem of dysarthric ASR and potentially establishes ASR for speech-impaired populations with enhanced accessibility.

Issue

Vol. 167 No. A2(S) (2025): Special Issue - New Technologies and their Effects on Real-Time Social Developments

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details

Most read articles by the same author(s)