OpenAI’s open-source speech-to-text model Whisper has become one of the most popular transcription engines in less than a year. OpenAI trained Whisper using 680,000 hours of multilingual data collected from the web . Although they didn’t open-source the training dataset, there are many open-source speech corpora for developers to train or test speech-to-text models. OpenSLR is a great database for starters to access speech corpora. Here’s the list of the most popular open speech-to-text datasets:

LibriSpeech by Panayotov et al.:

LibriSpeech is one of the most popular open-source speech-to-text datasets, if not the most popular one. It consists of 1,000 hours of English speech suitable for training and evaluating speech recognition systems. It is derived from the LibriVox project audiobooks.

Multilingual LibriSpeech by Pratap et al.:

Multilingual LibriSpeech is another open-source dataset derived by researchers using LibriVox audiobooks. It consists of 44.5K hours of English and 6K of other languages: German, Dutch, Spanish, French, Italian, Portuguese, and Polish.

Common Voice by Mozilla:

Common Voice is a crowdsourcing project initiated by Mozilla in 2017. Volunteers donate and validate recordings to create a free and open-source speech corpora to enable researchers to train or test speech recognition software even if they don’t have Big Tech resources. The dataset has ~19K hours in 112 languages validated out of 28K hours.

TED-LIUM by Laboratoire d’Informatique de l’Université du Maine (LIUM):

The TED-LIUM corpus is derived from the TED talks and by Laboratoire d’Informatique de l’Université du Maine researchers. Hence, the name TED-LIUM. The initial release of TED-LIUM consists of 118 hours of speech and text transcript data with a size of 20 GB. The 3rd release of TED-LIUM is 51 GB and has 452 hours of audio. All releases are published under Creative Commons BY-NC-ND 3.0 with no commercial usage rights.

Podcast Dataset by Spotify:

Spotify created and open-sourced the dataset using ~200,000 podcasts with a median duration of ~31 minutes. The dataset contains professional and self-hosted podcasts in English and Portuguese. 2 TB dataset comes with an audio file and a text transcript. Researchers can apply to get this dataset to use for non-commercial research purposes only.

Train and Test

Researchers within enterprises or academia use open-source datasets to train or test speech-to-text models. Especially enterprises with industry-specific terminology, privacy concerns, or in highly regulated industries frequently attempt to build their own model. A custom speech-to-text model that can accurately capture the terminology and be privacy-focused is a great idea, but the execution can be costly. There are other ways to improve speech-to-text accuracy, and using on-device speech recognition is the easiest way to ensure speech-to-text data privacy.

Picovoice open-sourced its internal speech-to-text benchmark framework and published WER (Word Error Rate) using open-source datasets LibriSpeech, Common Voice, and TED-LIUM to give developers a head start to evaluate the speech-to-text engines. While WER is a numerical value that can be compared easily, there are nuances to know before making any conclusions. Plus, there are other factors one should consider before choosing a speech-to-text engine. Picovoice Speech-to-Text is ideal for enterprises looking for custom, private, and accurate engines with zero latency!

Start Building