How to Automatic Speech Recognition(ASR)? - VB

Learn how to build Automatic Speech Recognition (ASR) models using encoder-decoder architecture, featuring convolutional and recurrent neural networks, and leveraging Hugging Face and Wave 2.0 libraries for pre-trained models and demos.

Key takeaways

Automatic Speech Recognition (ASR) is a process of converting spoken words into text, considering contextual and phonetic variations.
Waveform representation of speech is broken down into 10-millisecond snippets, and features such as MFCC are extracted.
The encoder-decoder architecture is used to process the speech, comprising of convolutional and recurrent neural networks.
Connectionist temporal classification (CTC) is used to match input and output sequences.
Model can learn to recognize words and phrases despite variations in pronunciation and dialect.
Hugging Face and Wave 2.0 are two prominent libraries for ASR, providing pre-trained models and demos.
The billion-dollar question in ASR is how to reconcile multiple alignments and variations in human speech.
Applications of ASR include voice assistants, messaging, and transcribing phone calls.
The speaker invites feedback and questions, offering to share the demo code and links to further information.

How to Automatic Speech Recognition(ASR)? - VB

More talks