How to Automatic Speech Recognition(ASR)? - VB

Learn how to build Automatic Speech Recognition (ASR) models using encoder-decoder architecture, featuring convolutional and recurrent neural networks, and leveraging Hugging Face and Wave 2.0 libraries for pre-trained models and demos.

Key takeaways
  • Automatic Speech Recognition (ASR) is a process of converting spoken words into text, considering contextual and phonetic variations.
  • Waveform representation of speech is broken down into 10-millisecond snippets, and features such as MFCC are extracted.
  • The encoder-decoder architecture is used to process the speech, comprising of convolutional and recurrent neural networks.
  • Connectionist temporal classification (CTC) is used to match input and output sequences.
  • Model can learn to recognize words and phrases despite variations in pronunciation and dialect.
  • Hugging Face and Wave 2.0 are two prominent libraries for ASR, providing pre-trained models and demos.
  • The billion-dollar question in ASR is how to reconcile multiple alignments and variations in human speech.
  • Applications of ASR include voice assistants, messaging, and transcribing phone calls.
  • The speaker invites feedback and questions, offering to share the demo code and links to further information.