A Visual-Acoustic Modeling Framework for Robust Dysarthric Speech Recognition Using Synthetic Visual Augmentation and Transfer Learning
Main Article Content
Abstract
Dysarthria is a motor speech disorder that affects an individual's ability to control his/her muscles, which seriously affects with their ability to communicate and to perform digital interaction. Automatic Speech Recognition (ASR) systems have made tremendous improvements but are limited to dysarthric speech, specifically in severe cases with they are unable to describe phonemes consistently. Also, it is provided with insufficient training data and unintuitive phoneme labelling. We present a visual acoustic modelling technique in a dysarthric-targeted ASR system. We suggest Speech Vision (SV). SV does not just depend on the audio but transforms the speech to visual spectrogram representations and trains the deep neural networks to identify the shape of the phoneme rather than the phoneme's variability when spoken. This reduces the solution from the traditional acoustic phoneme modelling that is required to address central dysarthric speech challenges. Specifically, to face data scarcity, SV uses visual data augmentation by producing synthesized dysarthric spectrograms from Generative Adversial Networks (GANs) and time-frequency distortions. Moreover, transfer learning is applied to utilize pre-trained healthy speech models to dysarthric speech for more robustness and generalization. We compare SV against the existing systems, DeepSpeech, DysarthricGAN-ASR, and Transfer-ASR, using the UA-Speech dataset. In 67% of the speakers, SV increased the accuracy of recognition by an average of 18.5%, with a significant reduction in average Word Error Rate (WER), particularly for severe dysarthria. By adopting visual learning, synthetic augmentation, and transfer learning in a single pipeline, SV is a new solution to overcome the problem of dysarthric ASR and potentially establishes ASR for speech-impaired populations with enhanced accessibility.