Open Datasets for Hotword and Keyword Spotting

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Keyword Spotting detects keywords and phrases within audio streams. It’s called wake word, wake up word, hotword, trigger word, or voice activation when used to activate or control the software. It’s called monitoring, moderation, or voice search when used to detect keywords or phrases used in audio streams, such as phone calls, broadcasting, social media, or voice chat. Detected keywords and phrases provide insights to adjust the urgency of calls, track brand awareness and competition, or filter profanity.

Picovoice uses open-source data sets to create transparent and reproducible benchmark frameworks to help developers find the best speech-to-text, noise suppression, keyword & phrase search, natural language understanding, wake word, and voice activity detection software. While building a reproducible benchmark framework to showcase the performance of Picovoice’s keyword spotting engine, Porcupine Wake Word, we experienced difficulty in finding an open-source data set. Hence, we indexed the most popular open-source keyword spotting data sets.

Open-source Dataset for Alexa:

Picovoice created and open-sourced the Alexa dataset, which consists of 329 recordings. Audio files are crowdsourced and published on Picovoice’s GitHub. The dataset is under Apache 2.0, allowing commercial use and modifications.

Open-source Dataset for Computer:

Picovoice’s open-source Computer dataset consists of 411 recordings. Audio files are crowdsourced from distinct speakers and published on Picovoice’s GitHub page. Mycroft also has an open-source Computer dataset with 59 recordings on its GitHub.

Open-source Dataset for Jarvis:

Picovoice’s open-source Jarvis dataset is crowdsourced from distinct speakers and includes 384 recordings. The dataset is published under Apache 2.0, allowing commercial use and modifications on Picovoice’s GitHub.

Open-source Dataset for Smart Mirror:

Open-source Smart Mirror dataset has 369 recordings crowdsourced from distinct speakers. Picovoice open-sourced the dataset under Apache 2.0.

Open-source Dataset for Snowboy:

Snowboy is a keyword-spotting engine developed by Kitt.ai. Kitt.ai stopped maintaining Snowboy in 2020 after being acquired by Baidu. Yet Snowboy was popular in 2018 when we first published the benchmark. Picovoice used Snowboy in its Open-source Wake Word Benchmark and crowdsourced Snowboy the wake word from distinct speakers. The open-source Snowboy dataset with 401 recordings is on Picovoice’s GitHub.

Open-source Dataset for View Glass:

Open-source View Glass dataset consists of 399 recordings crowdsourced from distinct speakers. Picovoice open-sourced the dataset under Apache 2.0.

Open-source Dataset for Hey Snips:

Hey Snips is the wake word of Snips, a smart speaker and voice assistant company acquired by Sonos. Snips open-sourced the dataset for its wake word, Hey Snips, which consists of 11,000 recordings. After the acquisition, the dataset now belongs to Sonos. Sonos shares the dataset for research academic/purposes only after evaluating the requests.

Open-source Dataset for Amelia:

The Mycroft community crowdsourced the open-source Amelia dataset. The dataset has 223 recordings published under the public domain.

Open-source Speech Commands Dataset:

The Speech Commands Dataset is created by TensorFlow and AIY teams. It includes common words such as Yes, No, digits, and directions. The dataset has 65,000 short utterances donated by thousands of different people.

Hey Snapdragon Keyword Dataset:

Qualcomm’s Hey Snapdragon Keyword Dataset has 4,270 recordings across four English keywords Hey Android, Hey Snapdragon, Hi Galaxy, and Hi Lumina. The dataset is intended solely for research purposes.

Keyword Spotting is a tiny technology. Thus, it is common to assume that building an engine might be easy, but building a keyword spotting engine that is efficient and accurate is not. A good keyword-spotting engine should be:

efficient and use-case specific, hence no use of speech-to-text
accurate, measured by the FRR and FAR
on the device, not in the cloud

Picovoice’s Porcupine offers enterprises and their users the best keyword-spotting experience with unmatched accuracy. Developers can instantly train custom keywords on no-code Picovoice Console and get Porcupine Wake Word running in minutes. Picovoice Consulting fine-tunes models to achieve even higher accuracy or better performance and ports to other platforms for more flexibility without requesting training data.

Start Building

Machine Learning Datasets: Open-source Keyword Spotting Speech Corpora