Today, there is no shortage of architectures for building a Speech-to-Text (STT) engine. No need to follow strict and convoluted recipes anymore. The choice of architecture has implications for product, timeline, and level of investments.

Speech-to-Text, Automatic Speech Recognition (ASR), and Large-Vocabulary Speech Recognition (LVSR) are interchangeable terms.

What is End-to-End Speech Recognition?

An End-to-end (deep learning) model accepts raw data and produces the labels directly. An end-to-end ASR receives raw audio and produces text as output.

Most "end-to-end" speech recognition architectures accept a low-level feature rather than raw PCM. Spectrogram and (biology-inspired) Filterbanks are the most common. Some end-to-end systems (optionally) fuse an external Language Model (LM).

What is Hybrid Speech-to-Text?

We represent spoken sounds with Phonemes. International Phonetic Alphabet (IPA) and ARPABET are examples of phonetic transcription codes.

The translation between phonemes and text (and vice versa) is not one-to-one, especially in languages such as English.

The hybrid systems break the problem into recognizing sounds and then transducing a sequence of sounds to words. A Deep Neural Network (DNN) is the tool of choice for Phoneme Recognition. The transduction of phonemes to text requires using a graph called Weighted Finite State Transducer (WFST).

State of Play

Commercially available systems are hybrid systems. Why? Legacy reasons and (or) that it is much easier to get the information a production system requires from hybrid models. Practical applications of ASR need more than just transcription. They need Word Timestamps, Word-Level Confidence, and Alternative Transcriptions. Last but not least, they require a method for rapid and economic adaptation.

The Trend

The trend is towards end-to-end models.

  • They are much easier to build. Kaldi is the crown jewel of hybrid systems and is hard to understand. People know how to use it as a black box, but only a handful understand the algorithms and their implementations.
  • Numerous research papers from Big Tech achieve unbelievably great results using end-to-end models.

The math of hybrid systems is just much harder to grasp.

The Decision

There is no straightforward answer for choosing between hybrid and end-to-end architecture. The decision depends on goals and resources.

  • How much data do you have? End-to-end systems work great given many thousands of hours of labelled audio.
  • How fast does your data change? e.g. new words showing up in data. If it is changing fast, you need to keep retraining. Otherwise, you get into a data mismatch problem. Hybrid systems are easier to retrain as you can adapt acoustic and language models separately.
  • Do you have in-house voice recognition expertise? If not, building a hybrid system can be impractical. End-to-end systems reduce development time significantly.