Accuracy is the performance result of models. It’s measured by the ratio of correct predictions over the total number of predictions. The higher the accuracy is, the better a model performs.
Acoustic Echo Cancellation:
Acoustic echo cancellation (AEC) is a front-end audio solution to filter unwanted sounds such as echoes or reverberation and improve speech input. Acoustic Echo Cancellation may be required for voice applications when voice is generated from a far end, such as a loudspeaker to improve experience and accuracy.
Acoustic processing in speech deals with the extraction of information from different acoustic signals. It is used to retrieve and generate phonetic information.
Alexa is Amazon’s wake word and digital assistant technology that was released in 2013. Alexa is capable of voice interactions and real-time information gathering by interacting with the cloud. While Alexa is the best-known voice assistant among end-users, it’s also been a controversial name due to privacy issues.
Alexa Skills are like applications that third-party developers build for Alexa. Amazon offers Alexa Skills Kit (ASK) to enable developers to build Alexa skills. Skills are published in Alexa Skills Stores after a certification process. Alexa Skills are equivalent to Google Actions for Alexa-empowered products.
Artificial intelligence (AI) is an interdisciplinary field of computer science and statistics. It enables machines to solve problems by simulating and mimicking human intelligence and actions. Voice AI is a subfield of artificial intelligence.
Automatic Speech Recognition (ASR):
Automatic Speech Recognition (ASR) focuses on converting spoken language to text. ASR deals with converting unstructured voice to structured text. ASR is also known as Speech-to-Text or Open-domain Large Vocabulary Speech Recognition. Picovoice's Speech-to-Text engines, Leopard converts audio recordings to text and Cheetah real-time speech to text.
Benchmark is a tool to evaluate the relative performances of hardware or software products on tasks by running standard tests and experiments.
While building Picovoice, we noted there was a need for a scientific tool to evaluate voice recognition engines and started publishing open-source benchmarks:
- Wake Word Benchmark (KWS & hotword)
- Speech-to-Intent Benchmark (VUI & NLU)
- Speech-to-Text Benchmark (ASR & STT)
- Speech-to-Index Benchmark (voice search)
- Voice Activity Detection Benchmark (VAD)
Branded Wake Word:
Branded wake words are wake words trained with brand or product names. For example, Alexa is Amazon’s, Hey Siri is Apple’s and Porcupine is Picovoice’s branded wake words. Enterprises can train branded wake words with Porcupine Wake Word. When you’re ready to train, don’t forget to check out our tips for choosing a wake word.
Bixby is Samsung’s virtual assistant similar to Alexa or Okay Google in 2017 to replace S Voice.
Built-in slots are used by natural language understanding (NLU) engines to help developers write expressions faster while developing voice products. Built-in slots are pre-defined slots to handle common requirements such as letters, numbers and ordinal numbers. Picovoice’s Speech-to-Intent engine, Rhino also offers built-in slots with 'pv.' prefix to distinguish them from custom slots. Don’t forget to check out the Rhino cheat sheet for details.
Cheetah Speech-to-Text is the first and only commercially available and supported streaming on-device speech-to-text engine. Cheetah processes voice data locally on the device without sending it to a 3rd party cloud.
Picovoice Console, or the Console for short, is a self-service and cloud-based platform to design, develop and train voice AI models. The Console has a type-and-train interface. Thus, no machine learning or coding experience is required to use the Console . Anyone with an email address can sign up and train voice models.
Cobra Voice Activity Detection:
Cobra Voice Activity Detection is a voice activity detection engine or VAD short. It detects human voice and distinguishes it from other audio inputs and noises. Picovoice initially developed Cobra as an internal tool and given the market demand made it publicly available.
A context consists of a set of intents and intent details, i.e. expressions and slots, within a domain of interest. For example, a context for a "smart lighting system" is built by using intents (turn on, turn off), slots (room: living room, bedroom, kitchen) and expressions (turn on the “room” lights). Check out the cheat sheet to learn how to build contexts with Rhino.
Cortana is Microsoft’s voice assistant similar to Alexa and Google Assistant. It was launched in 2014. Microsoft ended Cortana support for various platforms including iOS, Android and its own Surface Headphones and removed them from the marketplaces in 2021.
In computing, cross-platform, multi-platform or platform-independent refers to software that is developed to work across various computing platforms. Design once, deploy anywhere is also used to refer to it. Picovoice technology is hardware and platform-agnostic. Anyone can enjoy speech recognition on Android, iOS, Linux, macOS, Windows, and modern web browsers, such as Chrome, Firefox, Safari, also Raspberry Pi, BeagleBone, Arm Cortex-M, Arduino, and NVIDIA Jetson.
Deep learning, also known as deep structured learning or hierarchical learning, is a type of machine learning and artificial intelligence. Deep learning mimics how the human brain works and gains certain types of knowledge. Learning can be supervised, semi-supervised or unsupervised.
DeepSpeech or Mozilla DeepSpeech is one of the most known and accurate free and open-source (FOSS) speech-to-text engines. It was developed by Mozilla by using TensorFlow based on Baidu. Mozilla no longer maintains DeepSpeech. See other free and open source transcription engines.
Design thinking, also known as user-centred design, is a process to develop a product or service by empathizing with users and prioritizing their needs and pain points. It’s an iterative process that includes observation, reframing problems, ideation, and testing to make sure the solution is what end users want. We listed five tips to apply design thinking principles while building voice user interfaces.
Design once, deploy anywhere:
Design once, deploy anywhere or design once, deploy everywhere refers to software that is developed to work across various computing platforms. It’s also known as cross-platform
Eavesdrop is a verb used for secretly listening to a conversation. It’s been associated with smart speakers such as Amazon Echo, Google Home and Apple Homepod after consumers learned that their conversations were recorded without consent. In 2020, a former Amazon executive revealed that he’d switch off his smart speaker when he had private conversations.
Echo (Amazon Echo):
A smart speaker was released by Amazon in 2014. It has become the brand name of Amazon smart speakers since then.
Edge computing refers to an architecture where computing or storing data is done at or near the source. On-device processing also refers to edge computing. Edge computing brings the computer near to data while cloud computing brings the data near to computing. Both cloud and edge computing offer different advantages. Learn more about edge computing, cloud repatriation and running models on the edge, on-prem and in the cloud.
Edge Voice AI:
Edge voice AI refers to the technology of processing voice data locally on the device. Voice recognition on the edge does not require internet connectivity to process voice data and eliminates cloud-related costs. See the benefits of Edge Voice AI.
End-pointing in speech recognition focuses on understanding when the user is done speaking to a machine. “End-of-utterance detection”, “end-of-query detection” or “end-of-turn” detection can also be used to define this challenge. Both early and late endpoints hinder user experience. The early end-point refers to cutting a user off during a speech pause, such as taking a breath. The late end-point refers to waiting too long after a user is done speaking or continuing to listen when there is background noise including other speakers.
Modern speech recognition technology addresses this challenge. For example, Rhino allows voice command recognition even in presence of overlapping speech. Both Rhino and Cheetah allow adjusting endpoint duration, so developers can decide what works best for their use case.
Expressions refer to voice inputs by humans, i.e. how humans express themselves. They’re also known as spoken utterances or just utterances. In Natural Language Understanding, intents are composed of a collection of expressions. When a user's utterance matches any expression within an intent, the intent is detected. For example, "make coffee" or "make me a coffee" could be an expression, each of which signal the "Make Coffee" intent. Check out the Rhino syntax cheat sheet to learn more.
False Acceptance Rate (FAR):
A false accept or false positive indicates the presence of a condition when it’s not there. For example, researchers found out that sentences such as “I can spare” and “I don’t like the cold” activate the Google smart speaker by mistake. Activating dormant devices or applications falsely is considered a false acceptance. A system's false acceptance rate (FAR) is the ratio of the number of false acceptance divided by the total number of attempts. The total attempts consist of False Positive, False Negative, True Positive and True Negative.
False Rejection Rate (FRR):
A false rejection or false negative indicates the absence of a condition when it is actually present. For example, when you say “Alexa” to activate a smart speaker, if it misses then it’s considered a false reject. A system's false rejection rate (FRR) is the ratio of the number of false rejections divided by the total number of attempts. The total attempts consist of False Positive, False Negative, True Positive and True Negative.
Far-Field Speech Recognition:
Far-field speech recognition happens when there is a distance between source and computer. For example, smart speakers process voice data from far [far-field speech recognition] and mobile phones process it when the source is close. [near-field speech recognition] Far-field speech recognition depends on many factors including the distance, ambient noise level, reverberation (echo), quality of the microphone, and audio frontend used (if any).
GDPR stands for the General Data Protection Regulation, and it is a regulation on data protection and privacy in the European Union and the European Economic Area. GDPR considers any identifiable data , including voice as personal data. Learn more on GDPR and voice AI and how to ensure the privacy of voice data.
Google Actions are like applications that third-party developers build for Google Assistant. Google offers Actions Builders and Actions SDKs to enable developers to build Google Actions. Google Actions are equivalent to Alexa skills for Google Assistant. Google is sunsetting Actions for Google Assistant by June 2023.
Google Assistant is Google’s virtual assistant released in 2016 to replace Google Now. It’s Google's version of Alexa by Amazon, Siri by Apple, Cortana by Microsoft and Bixby by Samsung.
Google Nest is the brand name for Google’s smart home products including Google Home smart speakers, streaming devices and Nest thermostats. Google Assistant-powered Google Home smart speakers such as Google Home, Google Home Hub, and Google Home Mini are rebranded under Google Nest in 2018.
Grapheme-to-phoneme, also known as G2P, refers to the task of converting letters (grapheme sequence) to their pronunciations (phoneme sequence).
HIPAA stands for the Health Insurance Portability and Accountability Act. It’s a US federal law which governs the privacy and security of Personal Health.
A homophone is a noun used to describe words with the same sound, i.e. pronunciation but different meanings, or spelling. For example, to, too, and two are homophones. Recognizing homophones correctly is one of the widely known challenges in speech recognition.
A smart speaker was released by Apple in 2018 and discontinued in 2021. Homepod Mini which was released in 2020 replaced by Homepod.
A hotword or hot word is a special phrase that activates dormant applications or devices. It’s also known as the wake word, a special application of KWS. Don’t forget to check out Picovoice’s guide on selecting a custom hotword.
Hotword detection refers to the technology that detects a hotword to trigger an action. It’s an application of always listening commands and is also known as wake word detection. Learn about the differences between always listening commands and hotword detection in detail.
Etymologically hyper means excessive, hyper-customization or hyper-personalization refers to the flexibility and ability to offer products or services for individual user wants and needs. Hyper-customization is one of the features that differentiates Picovoice from other vendors. See why Picovoice section to learn more.
In-car entertainment (ICE) or In-vehicle infotainment (IVI) is a term used for a combination of hardware and software to provide control and entertainment to drivers. These systems include touch-free voice control, touch screens, touch-sensitive panels, or steering wheel controls.
An intent in speech recognition focuses on the general meaning of utterances. For example, when a user says “get me a large americano”, “I want a large americano” or “can you please make me a large americano” the user intends to order a cup of coffee. "Intent" is one of the most commonly used NLU terms.
Interactive Voice Response (IVR):
Interactive Voice Response (IVR) is a voice control application. It's an automated system that enables callers to interact with the host system through pre-recorded voice responses via a telephone keypad or speech recognition. For example, to reset their password, a user may receive a pre-recorded voice prompt to dial 4 or any other number, or a prompt to direct them to tell what they want to achieve. When a user says “reset my password” or dials 4, their call gets routed to a menu or a specialist. IVRs are used to minimize the number of agents and offer a better service to the users.
Kaldi is one of the famous free and open-source (FOSS) automatic speech recognition (ASR) software. Despite not leveraging deep learning, Kaldi is still widely used especially by researchers and scientists. Learn more about other free and open-source transcription engines.
Keyword Spotting (KWS) in speech recognition focuses on detecting a keyphrase within a stream of audio. Products leverage KWS always listen to recognize the specific key phrase.
KWS enables voice activation and wake word detection. Porcupine Wake Word uses Keyword Spotting. Learn more about Keyword Spotting in voice recognition.
In computing, latency refers to the delay in passing data to pass from one point to another one. From a user experience point of view, the delay between a user request and a product’s response to that request is perceived as friction. In speech recognition, latency for some applications such as AR/VR where voice is the only input hinders experience significantly. For other mission-critical applications such as voice-activated surgical robots, even milliseconds of delay could cause fatal results similar to autonomous devices. For industrial voice assistants, latency causes fluctuations in productivity. Edge Voice AI eliminates latency, learn more about the Edge Voice AI benefits.
Lemmatization refers to removing affixes based on morphological analysis of words to return the base or dictionary form of a word, which is known as the lemma.
Leopard Speech-to-Text is a local automatic speech recognition (ASR) engine that converts speech to text. Leopard processes voice data locally on the device without sending voice data to 3rd party cloud. It outperforms competing cloud-based or offline speech-to-text solutions by wide margins.
Lexicon refers to knowledge of words. The Economist found out that most adult native speakers know 20,000-35,000 words. For Picovoice products, it’s above 200,000 words in English.
Local Speech Recognition:
Local speech recognition refers to the technology of processing voice data on-device locally where voice data is generated or stored. Trained AI models are deployed on the device to process voice data locally without sending the data to the cloud. Since there is no internet connectivity required to process voice data, cloud-related costs including hidden ones or connectivity-related ones do not occur. Better performance and user experience are achieved by eliminating latency and connectivity issues and minimizing power consumption. Local speech recognition also offers cost-effectiveness at scale since voice commands are not charged based on per API call.
Microphone Array Beamforming:
Microphone array beamforming is an audio-front-end application. Beamforming processes signals from multiple omnidirectional sources (i.e. microphones) to focus on the most prominent sound (i.e. user’s voice) and disregard the other sounds (i.e. noises)
Morphological segmentation is an NLP method that breaks words into meaning-bearing morphemes, the smallest grammatical unit of speech. Three morphemes of "Unbreakable" are un (signifies not), 2. break (root), and 3. able (signifies an ability).
Multimodality refers to multimodal interfaces that offer users multiple interaction points. For example, a smart speaker can be activated by touching them or using a wake word. Another multi-modal example is voice typing, while using dictation on your mobile phone, you can see what’s written and select or edit the written text via touch or voice command. For the success of multi-modal applications design of GUI and VUI should go hand in hand.
Natural Language Generation (NLG):
Natural Language Generation (NLG) is a subfield of Natural Language Processing (NLP). NLG deals with generating a text response in natural language. Read more about the differences among NLP, NLG and NLU.
Natural Language Processing (NLP):
Natural Language Processing (NLP) is a field combination of linguistics, computer science, and artificial intelligence that deals with the interactions between computers and humans via spoken language inputs of text or audio. Read more about the differences among NLP, NLG and NLU.
Natural Language Understanding (NLU):
Natural Language Understanding (NLU) is a subfield of Natural Language Processing (NLP). NLU deals with machine reading and comprehension to understand the intent,i.e. the meaning of what it reads. Read more about the differences among NLU, NLP and NLG.
Artificial Neural Networks (ANN) or neural networks in short refers to an artificial network of neurons or nodes for solving artificial intelligence (AI) problems. Artificial neural networks are inspired by biological neural networks which constitute the brain. Neural networks are trained by processing examples that contain a known "input" and "result”. Neural networks “learn” by forming probability-weighted associations between inputs and results. After determining the error, the difference between the processed output of the network (e.g. prediction) and a target output, the network adjusts weighted associations accordingly by using error values.
No-code software development platform or no-code development in short enables both developers and non-programmers to create applications through graphical user interfaces without writing a code as in traditional computer programming. Picovoice’s Shepherd enables no-code voice AI for MCUs.
Noise suppression is like a filter to remove distracting ambient noises such as keyboard typing or fan noise to create a better experience.
No-input error is a type of error when an automatic speech recognition (ASR) doesn’t detect the speech input although it exists.
No-match error is a type of error when an automatic speech recognition (ASR) cannot match the speech input with the responses that it expects or knows.
Octopus is Picovoice’s Speech-to-Index Engine that indexes speech directly without relying on a text representation. Octopus's acoustic-only approach boosts accuracy by removing the out-of-vocabulary limitation and eliminating the problem of the competing hypothesis. For voice search applications, Octopus outperforms speech-to-text based solutions by wide margins. An open-source benchmarking framework is available for Octopus and ASR alternatives.
On-device Voice Processing:
On-device speech recognition, on-device voice recognition, on-device voice processing, on-device voice AI or Edge Voice AI refers to performing inference with models directly within the platform such as a mobile app, web browser or an MCU without sending voice data to the cloud. On-device voice processing eliminates cloud-related costs and offers improved experience and better performance.
Open-domain Large Vocabulary Speech Recognition:
Open-domain Large Vocabulary Speech Recognition refers to speech-to-text or automatic speech recognition. Open-domain Large Vocabulary Speech Recognition is required when a voice use case is not limited to a given domain or confined within a fixed set of commands. It’s useful when there is an inherent interest in capturing the transcription, such as meeting transcription, note-taking, and voice typing. Check out Picovoice’s strategy guide to select the best technology for your use case.
A phoneme is the unit of sound. As words are written with letters, they are pronounced (spoken) with phonemes. For example, the word "speech" has six letters and four phonemes (s-p-E-ch) phonemes.
Porcupine Wake Word:
Porcupine Wake Word trains and understands custom wake words and always-listening commands. Porcupine enables always-listening commands and offers a truly hands-free experience by replacing the need for a push-to-talk (PTT) button for rapid interactions without any friction. Don’t forget to check out open-source wake word benchmark.
Prototyping is a process where teams implement ideas into tangible forms at varying degrees of fidelity to capture concepts and test them on users. Prototypes enable teams to refine ideas and products to release the right products to the right audience. Picovoice Console enables users to design custom voice AI models that are immediately available to be tested on the web. It helps with the voice prototyping process to build voice user interfaces that users want.
Push-to-talk, also known as press-to-transmit, is a method of initiating voice transmission by switching from the reception mode. Push-to-talk buttons were first used for radio devices such as walkie-talkies, then adopted by other form factors such as mobile phones or smart speakers. Some multi-modal products may have soft push-to-talk buttons, instead of physical ones. In the last few years, in order to offer a truly hands-free experience to the users, push-to-talk buttons have been replaced by wake words.
The real-time factor (RTF) is the most common metric used to measure the speed of automatic speech recognition (ASR) solutions. It is calculated by dividing the time taken to transcribe the audio by the duration of the audio. For example, an RTF value of 0.5 means that the time spent to transcribe the file is half of the length of the audio file. In other words, transcribing an hour-long file takes 30 minutes. You can test Leopard’s RTF by yourself using the Free Plan.
Resource efficiency is a term for maximum output given minimal resources. Resource-efficient is used for products that consume minimal resources. Picovoice products consume minimal resources and prolong battery life. See why Picovoice section to learn more.
Reverberation in acoustics is the persistence of sound like an echo after a sound is produced. Reverberation is created when a reflected sound is built up and then decayed as the sound is absorbed by the surfaces in the space such as furniture, or air.
Rhino Speech-to-Intent is a context-aware SLU Engine. It directly infers intent from spoken commands within a given context of interest, in real-time whereas most SLU engines infer the intent from text. By eliminating the need for text representation, Rhino responds faster and more accurately .
ROC (Receiver Operating Characteristic):
ROC curve is used to evaluate the accuracy of binary classifiers, such as wake word detection. A receiver operating characteristic (ROC) curve plots true positive rates (TPR) against false-positive rates (FPR) at various sensitivity values. The larger the area under the curve is the better the accuracy of the product is. See how a ROC curve used to benchmark wake word engines.
SDK stands for software development kit. It is a set of tools in one installable package and provided by the vendors of hardware or software.
Search by Voice:
Search by voice or search with voice allows users to search queries by using voice input instead of typing. Voice search can replace type search directly or compliment it for multimodal interfaces. Check out Picovoice’s Google Chrome Extension demo for a hand-free Google Search experience.
Sensitivity is a value between 0.0 and 1.0. While a value of 0.0 suppresses all audio that of 1.0 suppresses no audio. A higher sensitivity value gives a lower miss rate at the expense of a higher false alarm rate. One should pick a sensitivity parameter that suits applications' requirements.
Shepherd No-code Platform:
Picovoice Shepherd is the first no-code platform for building voice interfaces on microcontrollers. Picovoice Shepherd accelerates prototyping, mitigates technical risks, and shortens time-to-market. Paired with Picovoice Console, users can deploy custom voice models onto microcontrollers instantly.
Siri is Apple’s digital assistant technology and wake word (Hey Siri) was released in 2011. Siri is capable of voice interactions and real-time information gathering by interacting with the cloud.
A slot or an entity is a set of specific pieces of information from an utterance to help machines understand the intent. Slots are accessible from various intents in the same model. For example, a slot can be different locations in a house such as the living room, kitchen or bedroom. Slots could be accessed from different intents such as “turn on lights” and “turn off lights” in a model trained for a smart home application. Slot is one of the most commonly used NLU terms. To make development process easier, Picovoice also offers built-in slots. Don’t forget to check out Rhino syntax cheat sheet for details.
A speech corpus, or spoken corpus is a large database of audio files and text transcriptions. A Corpus can be based on any written or spoken data including legal documents, interviews and social media. (Speech) Corpus is one of the most commonly used NLU terms.
Speech recognition is a common name for technology and methodologies that convert unstructured voice data into structured text. While speech data cannot be recognized by computers directly, trained AI algorithms recognize and transform them into text for human-computer interactions or analysis.
Common name for technology and methodologies that convert unstructured voice data into structured text. It’s also known as Automatic Speech Recognition (ASR) and Open-domain Large Vocabulary Speech Recognition. Picovoice offers two speech-to-text engines. Leopard Speech-to-Text for recordings and Cheetah for real-time streaming.
Speech-to-Intent is a term coined by Picovoice after developing Rhino. Rhino infers intents directly from speech and applies modern SLU principles. The name refers to the direct conversion of speech to intent without text in between.
Speech-to-Index indexes speech directly without relying on a text representation. Picovoice coined this term after developing Octopus. Text-based indexing and search algorithms do not work accurately for voice search applications. Thus, Octopus was initially developed as a response to this market demand.
Stemming is a method used in NLP and NLU. It reduces a word to its word stem. In simple terms, it chops off the ends of words and removes derivational affixes (suffixes and prefixes).
Text-to-Speech (TTS), also known as voice generation, is a technology that converts text to artificially produced speech.
A trigger word is another term used for a wake word, wake-up word, hotword and triggering word, which is a special form of KWS. Trigger word refers to a special phrase recognized by an application and initiates, i.e. triggers a process. It’s mostly used to activate a dormant application, hence it triggers an application to listen to further commands.
True negative or true rejection indicates the rejection of a condition when it is not there.
True positive or true acceptance indicates the acceptance of a condition when it is there.
An utterance means a spoken word or statement. It is a continuous piece of speech beginning and ending with a clear pause. Utterance is one of the most commonly used NLU terms.
User-centered design is a process to develop a product or service by keeping users at the core of the development process. Picovoice encourages organizations to follow user-centered design principles while developing voice products by offering an easy-to-use console that enables an iterative development process. We listed five tips to apply design thinking principles while building voice user interfaces and got five tips from conversation design expert Erika Hall.
Voice activation allows users to activate applications simply by talking, i.e. using their voice, instead of a touchscreen or buttons. It could be done via wake word, Porcupine or voice activity detection, Cobra. Learn more about voice activation and how to enable it.
Voice Activity Detection:
Voice activity detection or VAD in short is the technology used to detect the presence of a human voice and distinguish it from other sounds or noises. Check What’s Voice Activity Detection article to learn more or the voice activity detection benchmark to compare Cobra with webRTC VAD. If you’re ready to build, check this Python tutorial or start with your favourite SDK.
Voice biometrics is the technology that identifies specific markers within audio data. It’s like an audio version of a fingerprint that is unique to the person’s identity. Voice Biometrics is used for Voice Identification, Speaker Recognition and Voice Verification.
Voice search refers to keyword or query search within audio files. It can be thought of as the “ctrl /cmd + F” function for text-based content. However, since voice data is not structured, it’s not easy to perform a similar task. One approach to structure voice data is to transform voice into text by using an STT engine and then index text data. Another is to index directly. Octopus, Picovoice’s Voice Search Engine, uses the second approach, hence achieves high accuracy even with pronouns. Learn more about voice search or voice search applications.
Voice User Interface (VUI):
Voice User Interface (VUI) is an interface just like Graphic User Interface (GUI) and allows users to interact with machines with voice. VUI responses can be triggering an action, retrieving information or a process to complete an end-to-end task. We listed the challenges of building VUIs on mobile, five tips to apply design thinking principles while building voice user interfaces and got five tips from conversation design expert Erika Hall.
Wake word, wake-up word, or hotword is a module that is used to trigger an action. Most of the time, wake words are taken for granted, as they are short and simple voice commands. Do not forget to check out Picovoice’s guide on selecting a wake word and train your wake word with Porcupine when you’re ready, even if you want to run on an MCU or in a web browser.
Wake Word Detection:
Wake word detection refers to a task recognizing an utterance, a special phrase to activate a device or an application. For infamous voice assistants, famous wake words “Alexa”, “Hey Siri” or “OK Google” are detected via wake word detection technology. Wake word detection is an application of always listening commands and KWS. Porcupine is Picovoice’s wake word engine. [See: Porcupine] Do not forget to check out the open-source wake word benchmark before starting to build.
Wake-up word or WuW in short is another term for wake word, hotword or trigger word and is a special application of KWS. Learn about Keyword Spotting and the differences between these terminologies in detail.
webRTC VAD (Voice Activity Detection):
webRTC VAD is a module within Google’s free and open-source webRTC initiative . webRTC provides real-time communication capabilities via APIs and the voice activity detection module is used to classify whether an audio stream consists of voice data or not. webRTC VAD is mainly for telecommunication applications. Check out webRTC VAD’s performance compared to Cobra VAD.
Word Error Rate:
The word error rate (WER) is the most common metric to evaluate speech recognition engines. WER is calculated by dividing the sum of the errors by the total number of reference words. In other words, WER shows how close the real output (i.e. transcription by an ASR) is to the intended output (i.e. the original text). We prepared a beginners’ guide and things to know about WER and how to improve speech-to-text accuracy.
Don’t forget to compare Leopard’s WER vs. others.