Open Datasets for Natural Language Understanding

Picovoice uses open-source datasets to create transparent and reproducible benchmark frameworks to help developers find the best speech-to-text, noise suppression, keyword & phrase search, natural language understanding, wake word, and voice activity detection software. In some cases, it can be hard to find open-source datasets. Hence, the Picovoice team has listed the most known open-source speech corpora for natural language understanding.

Barista by Picovoice:

The Barista dataset data consists of 619 utterances crowdsourced from more than 50 unique speakers ordering coffee in English. Utterance examples are “Can I get a light roast 12 oz coffee?” “Make me a medium cappuccino with sugar.” The dataset is available on Picovoice’s GitHub page. The Barista dataset is published under Apache 2.0 and can be used even for commercial purposes.

Picovoice built a simple barista web demo to help visualize how a voice-controlled coffee maker can work.

SmartLights by Snips:

The SmartLights dataset consists of six intents to turn on or off the light or change its brightness or color, with a vocabulary size of approximately 400 words in English. After Snips was acquired by Sonos, the ownership of the dataset was transferred to Sonos. Sonos requires developers and researchers to fill out a form to provide access. The SmartLights dataset cannot be used for commercial purposes.

Check out Picovoice’s Smart Lightning web demo to try the variety of utterances that can control the lights.

SmartSpeakers by Snips:

SmartSpeakers consists of two datasets controlling a smart speaker in English and French. Example intents are NextSong, PreviousSong, ResumeMusic, VolumeDown, VolumeUp, VolumeSet. Similar to SmartLights by Snips, the distribution of SmartSpeakers datasets is controlled by Sonos. Sonos grants access to developers and researchers after reviewing the applications. The SmartSpeakers dataset cannot be used for commercial purposes.

Speech Commands Dataset by Fluent:

The Fluent Speech Commands dataset consists of simple voice assistant commands with 30,043 utterances from 97 speakers in English. Example utterances are “put on the music” or “turn up the heat in the kitchen.” Fluent has released the dataset for academic research only. The dataset and license details are available on Fluent’s Google Group.

Test or Train

Machine learning developers need data for training and testing purposes. If you’re interested in testing, i.e., choosing the best natural language understanding engine for your application, try Picovoice’s reproducible natural language understanding benchmark framework. It provides scientific ground to help enterprises compare alternatives and make informed decisions. For those interested in training a model from scratch, the next step is to decide on the framework, TensorFlow, PyTorch, or In-House. This can be overwhelming for many. It’s not uncommon for enterprises to re-evaluate the pros and cons of the build, use open-source, and buy at this stage.

Picovoice offers enterprises control over their data and product as if they build without development complexity and maintenance. Design voice user interfaces specific to your application on the no-code Picovoice Console. If you need help or further customization engage with Picovoice Consulting.

Start Building

Open-source Natural Language Understanding Datasets

Barista by Picovoice:

SmartLights by Snips:

SmartSpeakers by Snips:

Speech Commands Dataset by Fluent:

More from Picovoice