Accuracy is the performance result of models. It’s measured by the ratio of correct predictions over the total number of predictions. The higher the accuracy is, the better a model performs.
Acoustic Echo Cancellation:
Acoustic echo cancellation (AEC) is a front-end audio solution to filter unwanted sounds such as echoes or reverberation and improve speech input. Acoustic Echo Cancellation may be required for voice applications when voice is generated from a far end, such as a loudspeaker to improve experience and accuracy.
Acoustic processing in speech deals with the extraction of information from different acoustic signals. It is used to retrieve and generate phonetic information.
Alexa is Amazon’s wake word and digital assistant technology that was released in 2013. Alexa is capable of voice interactions and real-time information gathering by interacting with the cloud. While Alexa is the best-known voice assistant among end-users, it’s also been a controversial name due to privacy issues.
Alexa Skills are like applications that third-party developers build for Alexa. Amazon offers Alexa Skills Kit (ASK) to enable developers to build Alexa skills. Skills are published in Alexa Skills Stores after a certification process. Alexa Skills are equivalent to Google Actions for Alexa-empowered products.
Artificial intelligence (AI) is an interdisciplinary field of computer science and statistics. It enables machines to solve problems by simulating and mimicking human intelligence and actions. Voice AI is a subfield of artificial intelligence.
Automatic Speech Recognition (ASR):
Automatic Speech Recognition (ASR) focuses on converting spoken language to text. ASR deals with converting unstructured voice to structured text. ASR is also known as Speech-to-Text or Open-domain Large Vocabulary Speech Recognition. Picovoice's Speech-to-Text engines, Leopard converts audio recordings to text and Cheetah real-time speech to text.
The dictionary definition of babble is the sound of people talking without meaning. Babble noise in speech processing refers to the noise that constantly changes as people carry on conversations. It’s one of the most difficult challenges researchers address while working on speech enhancement and it affects speech intelligibility significantly.
Benchmark is a tool to evaluate the relative performances of hardware or software products on tasks by running standard tests and experiments.
While building Picovoice, we noted there was a need for a scientific tool to evaluate voice recognition engines and started publishing open-source benchmarks:
- Wake Word Benchmark (KWS & hotword)
- Speech-to-Intent Benchmark (VUI & NLU)
- Speech-to-Text Benchmark (ASR & STT)
- Noise Suppression Benchmark (VAD)
- Speech-to-Index Benchmark (phonetic search)
- Voice Activity Detection Benchmark (VAD)
Branded Wake Word:
Branded wake words are wake words trained with brand or product names. For example, Alexa is Amazon’s, Hey Siri is Apple’s and Porcupine is Picovoice’s branded wake words. Enterprises can train branded wake words with Porcupine Wake Word. When you’re ready to train, don’t forget to check out our tips for choosing a wake word.
Bixby is Samsung’s virtual assistant similar to Alexa or Okay Google in 2017 to replace S Voice.
Built-in slots are used by natural language understanding (NLU) engines to help developers write expressions faster while developing voice products. Built-in slots are pre-defined slots to handle common requirements such as letters, numbers and ordinal numbers. Picovoice’s Speech-to-Intent engine, Rhino also offers built-in slots with a 'pv.' prefix to distinguish them from custom slots. Don’t forget to check out the Rhino cheat sheet for details.
Cheetah Speech-to-Text is the first and only commercially available and supported streaming on-device speech-to-text engine. Cheetah processes voice data locally on the device without sending it to a 3rd party cloud.
Picovoice Console, or the Console for short, is a self-service and cloud-based platform to design, develop and train voice AI models. The Console has a type-and-train interface. Thus, no machine learning or coding experience is required to use the Console . Anyone with an email address can sign up and train voice models.
Cobra Voice Activity Detection:
Cobra Voice Activity Detection is a voice activity detection engine or VAD short. It detects human voice and distinguishes it from other audio inputs and noises. Picovoice initially developed Cobra as an internal tool and given the market demand made it publicly available.
A context consists of a set of intents and intent details, i.e. expressions and slots, within a domain of interest. For example, a context for a "smart lighting system" is built by using intents (turn on, turn off), slots (room: living room, bedroom, kitchen) and expressions (turn on the “room” lights). Check out the cheat sheet to learn how to build contexts with Rhino.
Cortana is Microsoft’s voice assistant similar to Alexa and Google Assistant. It was launched in 2014. Microsoft ended Cortana support for various platforms including iOS, Android and its own Surface Headphones and removed them from the marketplaces in 2021.
In computing, cross-platform, multi-platform or platform-independent refers to software that is developed to work across various computing platforms. Design once, deploy anywhere is also used to refer to it.
Picovoice technology is hardware and platform-agnostic. Anyone can enjoy speech recognition on Android, iOS, Linux, macOS, Windows, and modern web browsers, such as Chrome, Firefox, Safari, also Raspberry Pi, BeagleBone, Arm Cortex-M, Arduino, and NVIDIA Jetson. Don’t forget to check out other Picovoice platform features.
Deep learning, also known as deep structured learning or hierarchical learning, is a type of machine learning and artificial intelligence. Deep learning mimics how the human brain works and gains certain types of knowledge. There are different architectures and frameworks which can affect the performance of deep learning models.
DeepSpeech or Mozilla DeepSpeech is one of the most known and accurate free and open-source (FOSS) speech-to-text engines. It was developed by Mozilla by using TensorFlow based on Baidu. Mozilla no longer maintains DeepSpeech. See other free and open source transcription engines.
Design thinking, also known as user-centred design, is a process to develop a product or service by empathizing with users and prioritizing their needs and pain points. It’s an iterative process that includes observation, reframing problems, ideation, and testing to make sure the solution is what end users want. We listed five tips to apply design thinking principles while building voice user interfaces.
Eagle Speaker Recognition:
Eavesdrop is a verb used for secretly listening to a conversation. It’s been associated with smart speakers such as Amazon Echo, Google Home and Apple Homepod after consumers learned that their conversations were recorded without consent. Privacy is a concern for any voice project. Enterprises should do thorough research and select privacy-focused speech recognition solutions.
Echo (Amazon Echo):
Edge computing refers to an architecture where computing or storing data is done at or near the source. On-device processing also refers to edge computing. Edge computing brings the computer near to data while cloud computing brings the data near to computing. Both cloud and edge computing offer different advantages. Learn more about edge computing, cloud repatriation and running models on the edge, on-prem and in the cloud.
Edge Voice AI:
Edge voice AI refers to the technology of processing voice data locally on the device. Voice recognition on the edge does not send voice data to the cloud to process it and eliminates cloud-related costs. See the benefits of Edge Voice AI.
End-pointing in speech recognition focuses on understanding when the user is done speaking to a machine. “End-of-utterance detection”, “end-of-query detection” or “end-of-turn” detection can also be used to define this challenge.
Expressions refer to voice inputs by humans, i.e. how humans express themselves. They’re also known as spoken utterances or just utterances. In Natural Language Understanding, intents are composed of a collection of expressions. When a user's utterance matches any expression within an intent, the intent is detected. For example, "make coffee" or "make me a coffee" could be an expression, each of which signal the "Make Coffee" intent. Check out the Rhino syntax cheat sheet to learn more.
False Acceptance Rate (FAR):
A false accept or false positive indicates the presence of a condition when it’s not there. For example, researchers found out that sentences such as “I can spare” and “I don’t like the cold” activate the Google smart speaker by mistake . A system's false acceptance rate (FAR) is the ratio of the number of false acceptance divided by the total number of attempts. The total attempts consist of False Positive, False Negative, True Positive and True Negative.
False Rejection Rate (FRR):
A false rejection or false negative indicates the absence of a condition when it is actually present. For example, when you say “Alexa” to activate a smart speaker, if it misses then it’s considered a false reject. A system's false rejection rate (FRR) is the ratio of the number of false rejections divided by the total number of attempts. The total attempts consist of False Positive, False Negative, True Positive and True Negative.
Far-Field Speech Recognition:
Far-field speech recognition happens when there is a distance between the source and the computer. For example, smart speakers process voice data from far, hence far-field speech recognition. Mobile phones process it when the source is close, hence near-field speech recognition.
GDPR stands for the General Data Protection Regulation, and it is a regulation on data protection and privacy in the European Union and the European Economic Area. GDPR considers any identifiable data , including voice as personal data. Learn more on GDPR and voice AI and how to ensure the privacy of voice data.
Google Actions are like applications that third-party developers build for Google Assistant. Google offers Actions Builders and Actions SDKs to enable developers to build Google Actions. Google Actions are equivalent to Alexa skills for Google Assistant. Google is sunsetting Actions for Google Assistant by June 2023.
Google Assistant is Google’s virtual assistant released in 2016 to replace Google Now. It’s Google's version of Alexa by Amazon, Siri by Apple, Cortana by Microsoft and Bixby by Samsung.
Google Nest is the brand name for Google’s smart home products including Google Home smart speakers, streaming devices and Nest thermostats. Google Assistant-powered Google Home smart speakers such as Google Home, Google Home Hub, and Google Home Mini are rebranded under Google Nest in 2018.
Grapheme-to-phoneme, also known as G2P, refers to the task of converting letters (grapheme sequence) to their pronunciations (phoneme sequence).
HIPAA stands for the Health Insurance Portability and Accountability Act. It’s a US federal law that governs the privacy and security of Personal Health.
A homophone is a noun used to describe words with the same sound, i.e. pronunciation but different meanings, or spelling. For example, to, too, and two are homophones. Recognizing homophones correctly is one of the widely known challenges in speech recognition.
A smart speaker was released by Apple in 2018 and discontinued in 2021. Homepod Mini which was released in 2020 replaced by Homepod.
A hotword or hot word is a special phrase that activates dormant applications or devices. It’s also known as the wake word, a special application of KWS. Don’t forget to check out Picovoice’s guide on selecting a custom hotword.
Hotword detection refers to the technology that detects a hotword to trigger an action. It’s an application of always listening commands and is also known as wake word detection. Learn about the differences between always listening commands and hotword detection in detail.
In-car entertainment (ICE) or In-vehicle infotainment (IVI) is a term used for a combination of hardware and software to provide control and entertainment to drivers. These systems include touch-free voice control, touch screens, touch-sensitive panels, or steering wheel controls.
An intent in speech recognition focuses on the general meaning of utterances. For example, when a user says “get me a large americano”, “I want a large americano” or “can you please make me a large americano” the user intends to order a cup of coffee. "Intent" is one of the most commonly used NLU terms.
Interactive Voice Response (IVR):
Interactive Voice Response (IVR) is a voice control application. It's an automated system that enables callers to interact with the host system through pre-recorded voice responses via a telephone keypad or speech recognition. For example, to reset their password, a user may receive a pre-recorded voice prompt to dial 4 or any other number, or a prompt to direct them to tell what they want to achieve. When a user says “reset my password” or dials 4, their call gets routed to a menu or a specialist. IVRs are used to minimize the number of agents and offer a better service to the users.
Kaldi is one of the famous free and open-source (FOSS) automatic speech recognition (ASR) software. Kaldi is widely used especially by researchers and scientists. Learn more about other free and open-source transcription engines.
Keyword Spotting (KWS) in speech recognition focuses on detecting a keyphrase within a stream of audio. Products leverage KWS always listen to recognize the specific key phrase.
Koala Noise Suppression:
Koala Noise Suppression is Picovoice’s Speech Enhancement engine. It processes speech data on the device without sending it to any 3rd party cloud, making it a perfect choice for real-time noise cancellation.
In computing, latency refers to the delay in passing data to pass from one point to another one. From a user experience point of view, the delay between a user request and a product’s response to that request is perceived as friction. Latency in speech recognition hinders experience significantly for applications such as AR/VR. For other mission-critical applications such as voice-activated surgical robots, even milliseconds of delay could cause fatal results similar to autonomous devices. For industrial voice assistants, latency causes fluctuations in productivity. Edge Voice AI eliminates latency, learn more about the Edge Voice AI benefits.
Lemmatization refers to removing affixes based on morphological analysis of words to return the base or dictionary form of a word, which is known as the lemma.
Leopard Speech-to-Text is a local automatic speech recognition (ASR) engine that converts speech to text. Leopard processes voice data locally on the device without sending voice data to 3rd party cloud. It outperforms competing cloud-based or offline speech-to-text solutions by wide margins.
Lexicon refers to knowledge of words. The Economist found out that most adult native speakers know 20,000-35,000 words. For Picovoice products, it’s above 200,000 words in English.
Local Speech Recognition:
Local speech recognition refers to the technology of processing voice data on-device locally where voice data is generated or stored. Trained AI models are deployed on the device to process voice data locally without sending the data to the cloud. Since voice data is not sent to the cloud, cloud or connectivity-related costs do not occur. Better performance and user experience are achieved by eliminating latency and connectivity issues and minimizing power consumption. Local speech recognition also offers cost-effectiveness at scale since voice commands are not charged based on per API call.
Microphone Array Beamforming:
Microphone array beamforming is an audio-front-end application. Beamforming processes signals from multiple omnidirectional sources (i.e. microphones) to focus on the most prominent sound (i.e. user’s voice) and disregard the other sounds (i.e. noises)
Morphological segmentation is an NLP method that breaks words into meaning-bearing morphemes, the smallest grammatical unit of speech. Three morphemes of "Unbreakable" are un (signifies not), 2. break (root), and 3. able (signifies an ability).
Natural Language Generation (NLG):
Natural Language Generation (NLG) is a subfield of Natural Language Processing (NLP). NLG deals with generating a text response in natural language. Read more about the differences among NLP, NLG and NLU.
Natural Language Processing (NLP):
Natural Language Processing (NLP) is a field combination of linguistics, computer science, and artificial intelligence that deals with the interactions between computers and humans via spoken language inputs of text or audio. Read more about the differences among NLP, NLG and NLU.
Natural Language Understanding (NLU):
Natural Language Understanding (NLU) is a subfield of Natural Language Processing (NLP). NLU deals with machine reading and comprehension to understand the intent,i.e. the meaning of what it reads. Read more about the differences among NLU, NLP and NLG.
Artificial Neural Networks (ANN) or neural networks in short refers to an artificial network of neurons or nodes for solving artificial intelligence (AI) problems. Artificial neural networks are inspired by biological neural networks which constitute the brain. There are different architectures and frameworks which can affect the performance of neural networks.
No-code software development platform or no-code development in short enables both developers and non-programmers to create applications through graphical user interfaces without writing a code as in traditional computer programming. Picovoice’s Shepherd enables no-code voice AI for MCUs.
Noise suppression is like a filter to remove distracting ambient noises such as keyboard typing or fan noise to create a better experience. Koala Noise Suppression is the only high-quality, real-time, cross-platform, and production-ready noise cancellation software available to any developer.
No-input error is a type of error when an automatic speech recognition (ASR) doesn’t detect the speech input, although it exists.
No-match error is a type of error when an automatic speech recognition (ASR) cannot match the speech input with the responses that it expects or knows.
Octopus is Picovoice’s Speech-to-Index Engine that indexes speech directly without relying on a text representation. Octopus's acoustic-only approach boosts accuracy by removing the out-of-vocabulary limitation and eliminating the problem of the competing hypothesis. It’s the best Speech-to-Text Alternative for Search. It outperforms speech-to-text-based solutions by wide margins when it comes to finding phrases in audio files, media asset management, legal e-discovery, dialogue search, or social media listening.
On-device Voice Processing:
On-device speech recognition, on-device voice recognition, on-device voice processing, on-device voice AI or [Edge Voice AI](#edge-voice-ai) refers to performing inference with models directly within the platform such as a mobile app, web browser, or an MCU without sending voice data to the cloud. On-device voice processing eliminates cloud-related costs and offers improved experience and better performance.
Open-domain Large Vocabulary Speech Recognition:
Open-domain Large Vocabulary Speech Recognition refers to speech-to-text or automatic speech recognition. Open-domain Large Vocabulary Speech Recognition is required when a voice use case is not limited to a given domain or confined within a fixed set of commands. It’s useful when there is an inherent interest in capturing the transcription, such as meeting transcription, note-taking, and voice typing. Check out Picovoice’s strategy guide to select the best technology for your use case.
Orca Text-to-Speech is Picovoice’s voice generator. It converts written text into spoken audio output without network latency or jeopardizing user privacy.
A phoneme is a unit of sound. As words are written with letters, they are pronounced (spoken) with phonemes. For example, the word "speech" has six letters and four phonemes (s-p-E-ch) phonemes. International Phonetic Alphabet uses phonemes.
Phonetic search refers to keyword or query search within audio files. It can be thought of as the “ctrl /cmd + F” function for text-based content. However, since voice data is not structured, it’s not easy to perform a similar task. One approach to structure voice data is to transform voice into text by using an STT engine and then index text data. Another is to index directly. Octopus, Picovoice’s Phonetic Search Engine, uses the second approach, hence achieves high accuracy even with pronouns. Learn more about phonetic search or phonetic search applications.
Porcupine Wake Word:
Porcupine Wake Word trains and understands custom wake words and always-listening commands. Porcupine enables always-listening commands and offers a truly hands-free experience by replacing the need for a push-to-talk (PTT) button for rapid interactions without any friction. Don’t forget to check out the open-source wake word benchmark and evaluate the performance of wake word detection engines. If you don’t have test data, use open datasets for keyword spotting.
Push-to-talk, also known as press-to-transmit, is a method of initiating voice transmission by switching from the reception mode. Push-to-talk buttons were first used for radio devices such as walkie-talkies, then adopted by other form factors such as mobile phones or smart speakers. Some multi-modal products may have soft push-to-talk buttons, instead of physical ones. In the last few years, in order to offer a truly hands-free experience to the users, push-to-talk buttons have been replaced by wake words.
The real-time factor (RTF) is the most common metric used to measure the speed of automatic speech recognition (ASR) solutions. It is calculated by dividing the time taken to transcribe the audio by the duration of the audio. For example, an RTF value of 0.5 means that the time spent to transcribe the file is half of the length of the audio file. In other words, transcribing an hour-long file takes 30 minutes. You can test Leopard’s RTF by yourself using the Free Plan.
Reverberation in acoustics is the persistence of sound like an echo after a sound is produced. Reverberation is created when a reflected sound is built up and then decayed as the sound is absorbed by the surfaces in the space such as furniture, or air.
Rhino Speech-to-Intent is a context-aware SLU Engine. It’s the best Speech-to-Text Alternative (STTA) for voice assistants. It directly infers intent from spoken commands within a given context of interest, in real-time whereas most SLU engines infer the intent from text. By eliminating the need for text representation, Rhino responds faster and more accurately.
ROC (Receiver Operating Characteristic):
ROC curve is used to evaluate the accuracy of binary classifiers, such as wake word detection. A receiver operating characteristic (ROC) curve plots true positive rates (TPR) against false-positive rates (FPR) at various sensitivity values. The larger the area under the curve is the better the accuracy of the product is. See how a ROC curve is used to benchmark wake word engines. A version of ROC, the Detection Error Trade-off (DET) curve, is used to evaluate the performance of speaker recognition engines.
SDK stands for software development kit. It is a set of tools in one installable package and provided by the vendors of hardware or software.
Search by Voice:
Search by voice or search with voice allows users to search queries by using voice input instead of typing. Phonetic search can replace type search directly or complement it. Check out Picovoice’s Google Chrome Extension demo for a hand-free Google Search experience.
Sensitivity is a value between 0.0 and 1.0. While a value of 0.0 suppresses all audio that of 1.0 suppresses no audio. A higher sensitivity value gives a lower miss rate at the expense of a higher false alarm rate. One should pick a sensitivity parameter that suits the applications' requirements.
Shepherd No-code Platform:
Picovoice Shepherd is the first no-code platform for building voice interfaces on microcontrollers. Picovoice Shepherd accelerates prototyping, mitigates technical risks, and shortens time-to-market. Paired with Picovoice Console, users can deploy custom voice models onto microcontrollers instantly.
Siri is Apple’s digital assistant technology and wake word (Hey Siri) was released in 2011. Siri is capable of voice interactions and real-time information gathering by interacting with the cloud.
A slot or an entity is a set of specific pieces of information from an utterance to help machines understand the intent. Slots are accessible from various intents in the same model. For example, a slot can be different locations in a house such as the living room, kitchen or bedroom. Slots could be accessed from different intents such as “turn on lights” and “turn off lights” in a model trained for a smart home application. Slot is one of the most commonly used NLU terms. To make development process easier, Picovoice also offers built-in slots. Don’t forget to check out Rhino syntax cheat sheet for details.
Speaker Recognition is the technology that is used to identify and verify speakers based on their distinguishable voice characteristics. Speaker Recognition is a complex technology, there are several factors one should consider before choosing a speaker recognition engine. Certain use cases require language-agnostic, and text-independent speaker recognition engines, like Eagle Speaker Recognition.
A speech corpus, or spoken corpus is a large database of audio files and text transcriptions. A Corpus can be based on any written or spoken data including legal documents, interviews, and social media. Speech Corpus is crucial to train and test voice AI models. Check out most known open-source speech corpus for speech-to-text, open-source speech corpus for natural language understanding, and open-source speech corpus for keyword spotting.
Speech Intelligibility shows the percentage of speech that a listener can understand. Various factors, such as the articulation of a speaker or background noises, affect Speech Intelligibility.
Speech Quality is a metric used to measure speech enhancement.
Speech Recognition is a common name for technology and methodologies that convert unstructured voice data into structured text. While speech data cannot be recognized by computers directly, trained AI algorithms recognize and transform them into text for human-computer interactions or analysis. Despite the common myth, running speech recognition software does not require powerful computers, even a web browser, Raspberry Pi, or an MCU would work.
Common name for technology and methodologies that convert unstructured voice data into structured text. It’s also known as Automatic Speech Recognition (ASR) and Open-domain Large Vocabulary Speech Recognition. Using hybrid models or end-to-end speech-to-text models affects the performance of the voice products.
Picovoice offers two speech-to-text engines. Leopard Speech-to-Text for recordings and Cheetah for real-time streaming. Both Leopard and Cheetah are on-device speech-to-text engines that process voice data locally, resulting in private, accurate and fast experiences with zero-latency.
Speech-to-Intent is a term coined by Picovoice after developing Rhino. It’s the best Speech-to-Text alternative to develop use-case-specific voice assistants. Rhino infers intents directly from speech and applies modern SLU principles. The name refers to the direct conversion of speech to intent without the text in between.
Speech-to-Index indexes speech directly without relying on a text representation. It is the best Speech-to-Text alternative to search. Picovoice coined this term after developing Octopus. Text-based indexing and search algorithms do not work accurately for audio search applications. Thus, Octopus was initially developed as a response to this market demand.
Text-to-Speech (TTS), also known as voice generation, is a technology that converts text to artificially produced speech. Orca Text-to-Speech is Picovoice’s text-to-speech engine that converts written text into spoken audio output.
A trigger word is another term used for a wake word, wake-up word, hotword and triggering word, which is a special form of KWS. Trigger word refers to a special phrase recognized by an application and initiates, i.e. triggers a process. It’s mostly used to activate a dormant application, hence it triggers an application to listen to further commands. There are not many vendors offering trigger word. Porcupine Wake Word is Picovoice’s trigger word engine and developers’ favorite!
True negative or true rejection indicates the rejection of a condition when it is not there.
True positive or true acceptance indicates the acceptance of a condition when it is there.
An utterance means a spoken word or statement. It is a continuous piece of speech beginning and ending with a clear pause. Utterance is one of the most commonly used NLU terms.
User-centered design is a process to develop a product or service by keeping users at the core of the development process. Picovoice encourages organizations to follow user-centered design principles while developing voice products by offering an easy-to-use console that enables an iterative development process. We listed five tips to apply design thinking principles while building voice user interfaces and got five tips from conversation design expert Erika Hall.
Voice activation allows users to activate applications simply by talking, i.e. using their voice, instead of a touchscreen or buttons. It could be done via wake word, Porcupine or voice activity detection, Cobra. Learn more about voice activation and how to enable it.
Voice Activity Detection:
Voice activity detection or VAD in short is the technology used to detect the presence of a human voice and distinguish it from other sounds or noises. Check What’s Voice Activity Detection article to learn more or the voice activity detection benchmark to compare Cobra with webRTC VAD. If you’re ready to build, check this Python tutorial or start with your favourite SDK.
Voice biometrics is the technology that identifies specific markers within audio data. It’s like an audio version of a fingerprint that is unique to the person’s identity. Voice Biometrics is also known as for Voice Identification, Speaker Recognition and Voice Verification. Try Eagle Speaker Recognition, Picovoice’s voice biometrics engine!
Voice User Interface (VUI):
Voice User Interface (VUI) is an interface just like Graphic User Interface (GUI) and allows users to interact with machines with voice. VUI responses can be triggering an action, retrieving information or a process to complete an end-to-end task. We listed the challenges of building VUIs on mobile, five tips to apply design thinking principles while building voice user interfaces and got five tips from conversation design expert Erika Hall.
Wake word, wake-up word, or hotword is a module that is used to trigger an action. Most of the time, wake words are taken for granted, as they are short and simple voice commands. However, a good wake word detection should be use-case specific, accurate, and run on the device.
Wake Word Detection:
Wake word detection refers to a task recognizing an utterance, a special phrase to activate a device or an application. Wake word detection is an application of always listening commands and KWS. Porcupine is Picovoice’s wake word engine. Do not forget to check out the open-source wake word benchmark before starting to build.
Wake-up word or WuW in short is another term for wake word, hotword or trigger word and is a special application of KWS. Learn about Keyword Spotting and the differences between these terminologies in detail.
webRTC VAD (Voice Activity Detection):
webRTC VAD is a module within Google’s free and open-source webRTC initiative . webRTC provides real-time communication capabilities via APIs and the voice activity detection module is used to classify whether an audio stream consists of voice data or not. webRTC VAD is mainly for telecommunication applications. Check out webRTC VAD’s performance compared to Cobra VAD.
Word Error Rate:
The word error rate (WER) is the most common metric to evaluate speech recognition engines. WER is calculated by dividing the sum of the errors by the total number of reference words. In other words, WER shows how close the real output (i.e. transcription by an ASR) is to the intended output (i.e. the original text). We prepared a beginners’ guide and things to know about WER and how to improve speech-to-text accuracy.
Don’t forget to compare Leopard’s WER vs. others.