Today, there is no shortage of architectures for building a Speech-to-Text
(STT
) engine. No need to follow strict
and convoluted recipes anymore. The choice of architecture has implications for product, timeline, and level of investments.
Speech-to-Text
, Automatic Speech Recognition
(ASR
), and Large-Vocabulary Speech Recognition
(LVSR
) are interchangeable terms.
What is End-to-End Speech Recognition?
An End-to-end (deep learning) model accepts raw data and produces the labels directly. An end-to-end ASR receives raw audio and produces text as output.
Most "end-to-end" speech recognition architectures accept a low-level feature rather than raw PCM. Spectrogram
and
(biology-inspired) Filterbanks
are the most common. Some end-to-end systems (optionally) fuse an external
Language Model
(LM
).
What is Hybrid Speech-to-Text?
We represent spoken sounds with Phonemes
. International Phonetic Alphabet
(IPA
) and ARPABET
are examples of
phonetic transcription codes.
The translation between phonemes and text (and vice versa) is not one-to-one, especially in languages such as English.
The hybrid systems break the problem into recognizing sounds and then transducing a sequence of sounds to words. A
Deep Neural Network
(DNN
) is the tool of choice for Phoneme Recognition
. The transduction of phonemes to text requires
using a graph called Weighted Finite State Transducer
(WFST
).
State of Play
Commercially available systems are hybrid systems. Why? Legacy reasons and (or) that it is much easier to get the information
a production system requires from hybrid models. Practical applications of ASR need more than just transcription. They need
Word Timestamps
, Word-Level Confidence
, and Alternative Transcriptions
. Last but not least, they require a method
for rapid and economic adaptation.
The Trend
The trend is towards end-to-end models.
- They are much easier to build.
Kaldi
is the crown jewel of hybrid systems and is hard to understand. People know how to use it as a black box, but only a handful understand the algorithms and their implementations. - Numerous research papers from Big Tech achieve unbelievably great results using end-to-end models.
The math of hybrid systems is just much harder to grasp.
The Decision
There is no straightforward answer for choosing between hybrid and end-to-end architecture. The decision depends on goals and resources.
- How much data do you have? End-to-end systems work great given many thousands of hours of labelled audio.
- How fast does your data change? e.g. new words showing up in data. If it is changing fast, you need to keep retraining. Otherwise, you get into a data mismatch problem. Hybrid systems are easier to retrain as you can adapt acoustic and language models separately.
- Do you have in-house voice recognition expertise? If not, building a hybrid system can be impractical. End-to-end systems reduce development time significantly.