Glossary
A
Accuracy:
Accuracy is the performance result of models. It’s measured by the ratio of correct predictions over the total number of predictions. The higher the accuracy is, the better a model performs.
Acoustic Echo Cancellation:
Acoustic echo cancellation (AEC) is a front-end audio solution to filter unwanted sounds such as echoes or reverberation and improve speech input. Acoustic Echo Cancellation may be required for voice applications when voice is generated from a far end, such as a loudspeaker to improve experience and accuracy.
Acoustic Processing:
Acoustic processing in speech deals with the extraction of information from different acoustic signals. It is used to retrieve and generate phonetic information.
Alexa:
Alexa is Amazon’s wake word and digital assistant technology that was released in 2013. Alexa is capable of voice interactions and real-time information gathering by interacting with the cloud. While Alexa is the best-known voice assistant among end-users, it’s also been a controversial name due to privacy issues.
Alexa Skill:
Alexa Skills are like applications that third-party developers build for Alexa. Amazon offers Alexa Skills Kit (ASK) to enable developers to build Alexa skills. Skills are published in Alexa Skills Stores after a certification process. Alexa Skills are equivalent to Google Actions for Alexa-empowered products.
Artificial Intelligence:
Artificial intelligence (AI) is an interdisciplinary field of computer science and statistics. It enables machines to solve problems by simulating and mimicking human intelligence and actions. Voice AI is a subfield of artificial intelligence.
Automatic Speech Recognition (ASR):
Automatic Speech Recognition (ASR) focuses on converting spoken language to text. ASR deals with converting unstructured voice to structured text. ASR is also known as Speech-to-Text or Open-domain Large Vocabulary Speech Recognition. Picovoice's Speech-to-Text engines, Leopard converts audio recordings to text and Cheetah real-time speech to text.
B
Babble Noise:
The dictionary definition of babble is the sound of people talking without meaning. Babble noise in speech processing refers to the noise that constantly changes as people carry on conversations. It’s one of the most difficult challenges researchers address while working on speech enhancement and it affects speech intelligibility significantly.
Benchmark:
Benchmark is a tool to evaluate the relative performances of hardware or software products on tasks by running standard tests and experiments.
While building Picovoice, we noted there was a need for a scientific tool to evaluate voice recognition engines and started publishing open-source benchmarks:
- Open-source Wake Word Benchmark (KWS & hotword)
- Open-source Speech-to-Intent Benchmark (VUI & NLU)
- Open-source Speech-to-Text Benchmark (ASR & STT)
- Open-source Noise Suppression Benchmark (VAD)
- Open-source Speech-to-Index Benchmark (phonetic search)
- Open-source Voice Activity Detection Benchmark (VAD)
- Open-source Speaker Recognition Benchmark
- Open-source Speaker Diarization Benchmark
- Open-source Text-to-Speech Benchmark
- Open-source LLM Quantization Benchmark
Bit:
Short form for ‘binary digit’, a bit is the smallest form of data processed and stored by computers.
Bit Depth:
Bit depth is the number of bits required to indicate color information for each pixel of a photo. The higher the bit depth, the higher the amount of unique colors available to represent an image's color palette.
Bit Rate:
Bit rate is the rate that bits are processed per second. Common rates of measurement are bps (bits per second), Kbps (kilobits per second), or Mbps (megabits per second).
Branded Wake Word:
Branded wake words are wake words trained with brand or product names. For example, Alexa is Amazon’s, Hey Siri is Apple’s and Porcupine is Picovoice’s branded wake words. Enterprises can train branded wake words with Porcupine Wake Word. When you’re ready to train, don’t forget to check out our tips for choosing a wake word.
Bixby:
Bixby is Samsung’s virtual assistant similar to Alexa or Okay Google in 2017 to replace S Voice.
Built-in Slots:
Built-in slots are used by natural language understanding (NLU) engines to help developers write expressions faster while developing voice products. Built-in slots are pre-defined slots to handle common requirements such as letters, numbers and ordinal numbers. Picovoice’s Speech-to-Intent engine, Rhino also offers built-in slots with a 'pv.' prefix to distinguish them from custom slots. Don’t forget to check out the Rhino cheat sheet for details.
C
Cheetah Speech-to-Text:
Cheetah Speech-to-Text is the first and only commercially available and supported streaming on-device speech-to-text engine. Cheetah processes voice data locally on the device without sending it to a 3rd party cloud.
Console:
Picovoice Console, or the Console for short, is a self-service and cloud-based platform to design, develop, and train voice AI models. The Console has a type-and-train interface. Thus, no machine learning or coding experience is required to use the Console. Anyone with an email address can sign up and train voice models.
Cobra Voice Activity Detection:
Cobra Voice Activity Detection is a voice activity detection engine or VAD short. It detects human voice and distinguishes it from other audio inputs and noises. Picovoice initially developed Cobra as an internal tool and given the market demand made it publicly available.
Context:
A context consists of a set of intents and intent details, i.e. expressions and slots, within a domain of interest. For example, a context for a "smart lighting system" is built by using intents (turn on, turn off), slots (room: living room, bedroom, kitchen) and expressions (turn on the “room” lights). Check out the cheat sheet to learn how to build contexts with Rhino.
Cortana:
Cortana is Microsoft’s voice assistant similar to Alexa and Google Assistant. It was launched in 2014. Microsoft ended Cortana support for various platforms including iOS, Android and its own Surface Headphones and removed them from the marketplaces in 2021.
Cross-platform:
In computing, cross-platform, multi-platform or platform-independent refers to software that is developed to work across various computing platforms. Design once, deploy anywhere is also used to refer to it.
Picovoice technology is hardware and platform-agnostic. Anyone can enjoy speech recognition on Android, iOS, Linux, macOS, Windows, and modern web browsers, such as Chrome, Firefox, Safari, also Raspberry Pi, Arm Cortex-M, and Arduino. Don’t forget to check out other Picovoice platform features.
D
Deep Learning:
Deep learning, also known as deep structured learning or hierarchical learning, is a type of machine learning and artificial intelligence. Deep learning mimics how the human brain works and gains certain types of knowledge. There are different architectures and frameworks which can affect the performance of deep learning models.
DeepSpeech:
DeepSpeech or Mozilla DeepSpeech is one of the most known and accurate free and open-source (FOSS) speech-to-text engines. It was developed by Mozilla by using TensorFlow based on Baidu. Mozilla no longer maintains DeepSpeech. See other free and open source transcription engines.
Design Thinking:
Design thinking, also known as user-centred design, is a process of developing a product or service by empathizing with users and prioritizing their needs and pain points. It’s an iterative process that includes observation, reframing problems, ideation, and testing to make sure the solution is what end users want. We listed five tips to apply design thinking principles while building voice user interfaces.
E
Eagle Speaker Recognition:
Eagle Speaker Recognition is Picovoice’s speaker identification and verification engine. It is language-agnostic and text-independent, enabling several use cases, including personalization.
Eavesdrop:
Eavesdrop is a verb used for secretly listening to a conversation. It’s been associated with smart speakers such as Amazon Echo, Google Home, and Apple Homepod after consumers learned that their conversations were recorded without consent. Privacy is a concern for any voice project. Enterprises should do thorough research and select privacy-focused speech recognition solutions.
Echo (Amazon Echo):
A smart speaker was released by Amazon in 2014. It has become the brand name of Amazon smart speakers since then. It’s more commonly known as Alexa, with the name of the voice assistant.
Edge Computing:
Edge computing refers to an architecture where computing or storing data is done at or near the source. On-device processing also refers to edge computing. Edge computing brings the computer near to data while cloud computing brings the data near to computing. Both cloud and edge computing offer different advantages. Learn more about edge computing, cloud repatriation, and running models on the edge, on-prem, and in the cloud.
Edge Voice AI:
Edge voice AI refers to the technology of processing voice data locally on the device. Voice recognition on the edge does not send voice data to the cloud to process it and eliminates cloud-related costs. See the benefits of Edge Voice AI.
End-pointing:
End-pointing in speech recognition focuses on understanding when the user is done speaking to a machine. “End-of-utterance detection”, “end-of-query detection” or “end-of-turn” detection can also be used to define this challenge.
Both Rhino and Cheetah allow adjusting endpoint duration, so developers can decide what works best for their use case.
Embedded Systems:
An embedded system is a combination of hardware and software used to execute computer-related tasks. Embedded systems can work independently or can be integrated as a part of larger systems.
Expression:
Expressions refer to voice inputs by humans, i.e. how humans express themselves. They’re also known as spoken utterances or just utterances. In Natural Language Understanding, intents are composed of a collection of expressions. When a user's utterance matches any expression within an intent, the intent is detected. For example, "Make coffee" or "Make me a coffee" could be an expression, each of which signals the "Make Coffee" intent. Check out the Rhino syntax cheat sheet to learn more.
F
False Acceptance Rate (FAR):
A false accept or false positive indicates the presence of a condition when it’s not there. For example, researchers found out that sentences such as “I can spare” and “I don’t like the cold” activate the Google smart speaker by mistake. A system's false acceptance rate (FAR) is the ratio of the number of false acceptances divided by the total number of attempts. The total attempts consist of False Positive, False Negative, True Positive, and True Negative.
The False Acceptance Rate is used in evaluating the performance of Speaker Recognition, Voice Activity Detection, and Wake Word Detection software.
False Rejection Rate (FRR):
A false rejection or false negative indicates the absence of a condition when it is actually present. For example, when you say “Alexa” to activate a smart speaker, if it misses then it’s considered a false reject. A system's false rejection rate (FRR) is the ratio of the number of false rejections divided by the total number of attempts. The total attempts consist of False Positive, False Negative, True Positive, and True Negative.
The False Rejection Rate is used in evaluating the performance of Speaker Recognition, Voice Activity Detection, and Wake Word Detection software.
Far-Field Speech Recognition:
Far-field speech recognition happens when there is a distance between the source and the computer. For example, smart speakers process voice data from far, hence far-field speech recognition. Mobile phones process it when the source is close, hence near-field speech recognition.
Far-field speech recognition depends on many factors including the distance, ambient noise level, reverberation, quality of the microphone, and the use of audio frontend, such as beamforming.
G
GDPR:
GDPR stands for the General Data Protection Regulation, and it is a regulation on data protection and privacy in the European Union and the European Economic Area. GDPR considers any identifiable data, including voice as personal data. Learn more on GDPR and voice AI and how to ensure the privacy of voice data.
GenAI:
Generative artificial intelligence (GenAI) refers to AI systems capable of producing diverse content such as text, images, videos, and more in response to specific prompts. These AI generators, such as ChatGPT and DALL-E2, produce narratives, images, and even short bits of code based on user-provided natural language.
Read our engineering article Inference Engine for X-bit Quantized LLMs to learn difference between GenAI, LLM, and Transformers in detail.
Google Actions:
Google Actions are like applications that third-party developers build for Google Assistant. Google offers Actions Builders and Actions SDKs to enable developers to build Google Actions. Google Actions are equivalent to Alexa skills for Google Assistant. Google is sunsetting Actions for Google Assistant by June 2023.
Google Assistant:
Google Assistant is Google’s virtual assistant released in 2016 to replace Google Now. It’s Google's version of Alexa by Amazon, Siri by Apple, Cortana by Microsoft and Bixby by Samsung.
Google Nest:
Google Nest is the brand name for Google’s smart home products including Google Home smart speakers, streaming devices and Nest thermostats. Google Assistant-powered Google Home smart speakers such as Google Home, Google Home Hub, and Google Home Mini were rebranded under Google Nest in 2018.
Grapheme-to-phoneme:
Grapheme-to-phoneme, also known as G2P, refers to the task of converting letters (grapheme sequence) to their pronunciations (phoneme sequence).
H
HIPAA:
HIPAA stands for the Health Insurance Portability and Accountability Act. It’s a US federal law that governs the privacy and security of Personal Health.
Homophone:
A homophone is a noun used to describe words with the same sound, i.e. pronunciation but different meanings, or spelling. For example, to, too, and two are homophones. Recognizing homophones correctly is one of the widely known challenges in speech recognition.
Homepod:
A smart speaker was released by Apple in 2018 and discontinued in 2021. Homepod Mini which was released in 2020 replaced by Homepod.
Hotword:
A hotword or hot word is a special phrase that activates dormant applications or devices. It’s also known as the wake word, a special application of KWS. Don’t forget to check out Picovoice’s guide on selecting a custom hotword.
Hotword Detection:
Hotword detection refers to the technology that detects a hotword to trigger an action. It’s an application of always listening commands and is also known as wake word detection. Learn about the differences between always listening commands and hotword detection in detail.
I
In-car Entertainment:
In-car entertainment (ICE) or In-vehicle infotainment (IVI) is a term used for a combination of hardware and software to provide control and entertainment to drivers. These systems include touch-free voice control, touch screens, touch-sensitive panels, or steering wheel controls.
Inference Engine:
An inference engine is a necessary component of an expert system. Their main function is to infer data based on a predefined set of standards and rules.
Intent:
An intent in speech recognition focuses on the general meaning of utterances. For example, when a user says “Get me a large americano”, “I want a large americano” or “Can you please make me a large americano” the user intends to order a cup of coffee. "Intent" is one of the most commonly used NLU terms.
Interactive Voice Response (IVR):
Interactive Voice Response (IVR) is a voice control application. It's an automated system that enables callers to interact with the host system through pre-recorded voice responses via a telephone keypad or speech recognition. For example, to reset their password, a user may receive a pre-recorded voice prompt to dial 4 or any other number, or a prompt to direct them to tell what they want to achieve. When a user says “reset my password” or dials 4, their call gets routed to a menu or a specialist. IVRs are used to minimize the number of agents and offer a better service to the users.
K
Kaldi:
Kaldi is one of the famous free and open-source (FOSS) automatic speech recognition (ASR) software. Kaldi is widely used especially by researchers and scientists. Learn more about other free and open-source transcription engines.
Keyword Spotting:
Keyword Spotting (KWS) in speech recognition focuses on detecting a key phrase within a stream of audio. Products leverage KWS always listen to recognize the specific key phrase.
KWS enables voice activation and wake word detection. Porcupine Wake Word uses Keyword Spotting. Learn more about Keyword Spotting in voice recognition.
Koala Noise Suppression:
Koala Noise Suppression is Picovoice’s Speech Enhancement engine. It processes speech data on the device without sending it to any 3rd party cloud, making it a perfect choice for real-time noise cancellation.
L
Large Language Model:
Large Language Models, LLMs in short, are Artificial intelligence (AI) models trained on massive amounts of data to generate human-like language, understand context, and perform various natural language processing (NLP) tasks. LLMs use complex algorithms and neural networks to learn patterns and relationships in language, enabling them to generate coherent and often indistinguishable text from human-written content.
Read more on the difference between Transformer, LLM, and Generative AI.
Latency:
In computing, latency refers to the delay in passing data to pass from one point to another one. From a user experience point of view, the delay between a user request and a product’s response to that request is perceived as friction. Latency in speech recognition hinders experience significantly for applications such as AR/VR. For other mission-critical applications such as voice-activated surgical robots, even milliseconds of delay could cause fatal results similar to autonomous devices. For industrial voice assistants, latency causes fluctuations in productivity. Edge Voice AI eliminates latency, learn more about the Edge Voice AI benefits.
Lemmatization:
Lemmatization refers to removing affixes based on morphological analysis of words to return the base or dictionary form of a word, which is known as the lemma.
Leopard Speech-to-Text:
Leopard Speech-to-Text is a local automatic speech recognition (ASR) engine that converts speech to text. Leopard processes voice data locally on the device without sending voice data to 3rd party cloud. It outperforms competing cloud-based or offline speech-to-text solutions by wide margins.
Lexicon:
Lexicon refers to knowledge of words. The Economist found out that most adult native speakers know 20,000-35,000 words. For Picovoice products, it’s above 200,000 words in English.
LLM Compression Techniques:
As Large Language Models (LLMs) become more complex, they also grow in size. This causes issues regarding latency and usage in smaller devices. LLM compression techniques minimize the size of LLMs and their computing requirements to improve the overall user experience.
Learn more about LLM quantization and Picovoice’s approach: X-bit quantization.
Local Large Language Model:
A Local Large Language Model is a Large Language Model (LLM) that is deployed and run on a local device, such as a computer, smartphone, or edge device, rather than on a remote cloud server. Local LLMs can be integrated into any software, web-based, mobile, or desktop applications.
Running LLMs locally allows for faster processing, reduced latency, and increased security and compliance as sensitive data does not need to be transmitted to the cloud. Local LLMs are particularly useful for applications that require real-time language processing, such as voice assistants, language translation apps, and chatbots.
Learn more about the benefits of edge computing. Also, don’t forget to check picoLLM, the world’s only end-to-end Local LLM platform!
Local Speech Recognition:
Local speech recognition refers to the technology of processing voice data on-device locally where voice data is generated or stored. Trained AI models are deployed on the device to process voice data locally without sending the data to the cloud. Since voice data is not sent to the cloud, cloud or connectivity-related costs do not occur. Better performance and user experience are achieved by eliminating latency and connectivity issues and minimizing power consumption. Local speech recognition also offers cost-effectiveness at scale since voice commands are not charged based on per API call.
Look Up Table (LUT):
A data structure that maps input values to matching output values. Usually, it is made up of two primary parts: the related output value and the input value. LUTs are frequently used to effectively obtain and manipulate data based on established mappings in rule-based systems and data processing applications.
Look Up Table operations are involved in Non-Uniform Quantization.
M
Massive Multitask Language Understanding (MMLU):
A state-of-the-art method in artificial intelligence (AI) and natural language processing (NLP). One model is trained simultaneously on a variety of language tasks, ranging from simple text categorization to complex question answering and translation. MMLU makes it possible for AI systems to comprehend and communicate with human language more effectively by utilizing transfer learning and managing a variety of tasks.
Check out the open-source LLM Compression benchmark how picoLLM and GPTQ perform using MMLU.
Microphone Array Beamforming:
Microphone array beamforming is an audio-front-end application. Beamforming processes signals from multiple omnidirectional sources (i.e. microphones) to focus on the most prominent sound (i.e. user’s voice) and disregard the other sounds (i.e. noises)
Morphological Segmentation:
Morphological segmentation is an NLP method that breaks words into meaning-bearing morphemes, the smallest grammatical unit of speech. Three morphemes of "Unbreakable" are un (signifies not), 2. break (root), and 3. able (signifies an ability).
N
Natural Language Generation (NLG):
Natural Language Generation (NLG) is a subfield of Natural Language Processing (NLP). NLG deals with generating a text response in natural language. Read more about the differences among NLP, NLG, and NLU.
Natural Language Processing (NLP):
Natural Language Processing (NLP) is a field combination of linguistics, computer science, and artificial intelligence that deals with the interactions between computers and humans via spoken language inputs of text or audio. Read more about the differences among NLP, NLG, and NLU.
Natural Language Understanding (NLU):
Natural Language Understanding (NLU) is a subfield of Natural Language Processing (NLP). NLU deals with machine reading and comprehension to understand the intent,i.e. the meaning of what it reads. Read more about the differences among NLU, NLP, and NLG.
Neural Networks:
Artificial Neural Networks (ANN) or neural networks in short refers to an artificial network of neurons or nodes for solving artificial intelligence (AI) problems. Artificial neural networks are inspired by biological neural networks which constitute the brain. There are different architectures and frameworks which can affect the performance of neural networks.
No-code Development:
No-code software development platform or no-code development in short enables both developers and non-programmers to create applications through graphical user interfaces without writing a code as in traditional computer programming. Picovoice’s Shepherd enables no-code voice AI for MCUs.
Noise Suppression:
Noise suppression is like a filter to remove distracting ambient noises such as keyboard typing or fan noise to create a better experience. Koala Noise Suppression is the only high-quality, real-time, cross-platform, and production-ready noise cancellation software available to any developer.
No-input Error:
No-input error is a type of error when an automatic speech recognition (ASR) doesn’t detect the speech input, although it exists.
No-Match Error:
No-match error is a type of error when an automatic speech recognition (ASR) cannot match the speech input with the responses that it expects or knows.
O
OEMs:
An acronym for Original Equipment Manufacturers, OEMs manufacture products or components used by another company. Dell and HP are examples of OEMs for laptops and desktops.
Octopus Speech-to-Index:
Octopus is Picovoice’s Speech-to-Index Engine that indexes speech directly without relying on a text representation. Octopus's acoustic-only approach boosts accuracy by removing the out-of-vocabulary limitation and eliminating the problem of the competing hypothesis. It’s the best Speech-to-Text Alternative for Search. It outperforms speech-to-text-based solutions by wide margins when it comes to finding phrases in audio files, media asset management, legal e-discovery, dialogue search, or social media listening.
On-device Voice Processing:
On-device speech recognition, on-device voice recognition, on-device voice processing, on-device voice AI, or Edge Voice AI refers to performing inference with models directly within the platform such as a mobile app, web browser, or an MCU without sending voice data to the cloud. On-device voice processing eliminates cloud-related costs and offers improved experience and better performance.
Open-domain Large Vocabulary Speech Recognition:
Open-domain Large Vocabulary Speech Recognition refers to speech-to-text or automatic speech recognition. Open-domain Large Vocabulary Speech Recognition is required when a voice use case is not limited to a given domain or confined within a fixed set of commands. It’s useful when there is an inherent interest in capturing the transcription, such as meeting transcription, note-taking, and voice typing. Check out Picovoice’s strategy guide to select the best technology for your use case.
Open Weight:
Open weights of LLMs refer to the model weights that are freely available with minimal or no restrictions. It’s a license framework created especially for neural network weights (NNWs). Open-weight models are different from conventional open-source software licensing. The open-weight models allow machine learning researchers to make changes such as quantization and fine-tuning using weights. However, it doesn’t necessarily mean they are open-source.
LLaMA, Mistral, and Mixtral are examples of open-weight models and MPT and Dolly are examples of open-source large language models.
Orca Text-to-Speech
Orca Text-to-Speech is Picovoice’s voice generator. It converts written text into spoken audio output without network latency or jeopardizing user privacy.
P
Phoneme:
A phoneme is a unit of sound. As words are written with letters, they are pronounced (spoken) with phonemes. For example, the word "speech" has six letters and four phonemes (s-p-E-ch) phonemes. The International Phonetic Alphabet uses phonemes.
Phonetic Search:
Phonetic search refers to keyword or query search within audio files. It can be thought of as the “ctrl /cmd + F” function for text-based content. However, since voice data is not structured, it’s not easy to perform a similar task. One approach to structure voice data is to transform voice into text by using an STT engine and then index text data. Another is to index directly. Octopus, Picovoice’s Phonetic Search Engine, uses the second approach, hence achieves high accuracy even with pronouns. Learn more about phonetic search or phonetic search applications.
Porcupine Wake Word:
Porcupine Wake Word trains and understands custom wake words and always-listening commands. Porcupine enables always-listening commands and offers a truly hands-free experience by replacing the need for a push-to-talk (PTT) button for rapid interactions without any friction. Don’t forget to check out the open-source wake word benchmark and evaluate the performance of wake word detection engines. If you don’t have test data, use open datasets for keyword spotting.
Push-to-talk:
Push-to-talk, also known as press-to-transmit, is a method of initiating voice transmission by switching from the reception mode. Push-to-talk buttons were first used for radio devices such as walkie-talkies, then adopted by other form factors such as mobile phones or smart speakers. Some multi-modal products may have soft push-to-talk buttons, instead of physical ones. In the last few years, in order to offer a truly hands-free experience to the users, push-to-talk buttons have been replaced by wake words.
Q
Quantization:
Quantization is a process where input values from a large set of data are mapped to output values in a smaller data set. The goal of quantization is to reduce the size of a model to reduce latency and enable the deployment of AI models on smaller devices.
Learn more about LLM quantization and X-bit quantization.
R
Real-time Factor:
The real-time factor (RTF) is the most common metric used to measure the speed of automatic speech recognition (ASR) solutions. It is calculated by dividing the time taken to transcribe the audio by the duration of the audio. For example, an RTF value of 0.5 means that the time spent to transcribe the file is half of the length of the audio file. In other words, transcribing an hour-long file takes 30 minutes. You can test Leopard’s RTF by yourself using the Free Plan.
Reverberation:
Reverberation in acoustics is the persistence of sound like an echo after a sound is produced. Reverberation is created when a reflected sound is built up and then decayed as the sound is absorbed by the surfaces in the space such as furniture, or air.
Rhino Speech-to-Intent:
Rhino Speech-to-Intent is a context-aware SLU Engine. It’s the best Speech-to-Text Alternative (STTA) for voice assistants. It directly infers intent from spoken commands within a given context of interest, in real-time whereas most SLU engines infer the intent from text. By eliminating the need for text representation, Rhino responds faster and more accurately.
ROC (Receiver Operating Characteristic):
ROC curve is used to evaluate the accuracy of binary classifiers, such as wake word detection. A receiver operating characteristic (ROC) curve plots true positive rates (TPR) against false-positive rates (FPR) at various sensitivity values. The larger the area under the curve is the better the accuracy of the product is. See how the ROC curve is used to benchmark wake word engines. A version of ROC, the Detection Error Trade-off (DET) curve, is used to evaluate the performance of speaker recognition engines.
S
SDK:
SDK stands for software development kit. It is a set of tools in one installable package and provided by the vendors of hardware or software.
Picovoice supports a variety of SDKs to enable more developers to build voice AI-powered products. Android, Flutter, iOS, React Native, Web, Angular, Vue, .NET, C, Go, Java, Node.js, Python, Rust, and Unity are SDKs to build for mobile, desktop, or server (Linux and Linux-based systems such as Ubuntu, macOS, Windows) and web applications (Chrome, Safari, Firefox, Edge) are SDKs provided by Picovoice. Visit the docs page for Speech-to-Text, Speech-to-Index, Speech-to-Intent, Wake Word, and Voice Activity Detection to see the SDKs. Every SDK comes with a demo application. Don’t forget to check out Picovoice’s blog for tutorials such as Adding Subtitles with Python, Speech-to-Text using JavaScript, or Speech-to-Text using Node.js.
Search by Voice:
Search by voice or search with voice allows users to search queries by using voice input instead of typing. Phonetic search can replace type search directly or complement it. Check out Picovoice’s Google Chrome Extension demo for a hand-free Google Search experience.
Sensitivity:
Sensitivity is a value between 0.0 and 1.0. While a value of 0.0 suppresses all audio that of 1.0 suppresses no audio. A higher sensitivity value gives a lower miss rate at the expense of a higher false alarm rate. One should pick a sensitivity parameter that suits the applications' requirements.
Shepherd No-code Platform:
Picovoice Shepherd is the first no-code platform for building voice interfaces on microcontrollers. Picovoice Shepherd accelerates prototyping, mitigates technical risks, and shortens time-to-market. Paired with Picovoice Console, users can deploy custom voice models onto microcontrollers instantly.
Single Instruction Multiple Data (SIMD):
A parallel computing technique that allows a single command to process several data points at once. When doing tasks that can be parallelized, like digital signal processing, multimedia processing, and scientific computations, this technique is employed to increase performance. Data-intensive operations can be processed more quickly because of SIMD, which allows a processor to execute the same action on many pieces of data in a single clock cycle.
Picovoice engines run on WebAssembly with SIMD support.
Siri:
Siri is Apple’s digital assistant technology and wake word (Hey Siri) was released in 2011. Siri is capable of voice interactions and real-time information gathering by interacting with the cloud.
Slot:
A slot or an entity is a set of specific pieces of information from an utterance to help machines understand the intent. Slots are accessible from various intents in the same model. For example, a slot may consist of the locations in a house, such as the living room, kitchen, or bedroom. Slots could be accessed from different intents such as “turn on lights” and “turn off lights” in a model trained for a smart home application. Slot is one of the most commonly used NLU terms. To make the development process easier, Picovoice also offers built-in slots. Don’t forget to check out the Rhino syntax cheat sheet for details.
SOTA Algorithms:
SOTA is an acronym for ‘State of the Art’. SOTA algorithms are the current superior performing algorithms in the field.
Speaker Diarization:
Speaker Diarization answers the question, "Who spoke when?", by segmenting audio recordings for each speaker participating in conversations. Most speech-to-text engines embed speaker diarization as a feature. Hence, there are not many standalone speaker diarization API and SDK options available for developers. Falcon Speaker Diarization is not just one of few available ones, but also the most efficient one.
Speaker Recognition:
Speaker Recognition is the technology that is used to identify and verify speakers based on their distinguishable voice characteristics. Speaker Recognition is a complex technology, there are several factors one should consider before choosing a speaker recognition engine. Certain use cases require language-agnostic, and text-independent speaker recognition engines, like Eagle Speaker Recognition.
Speech Corpus:
A speech corpus, or spoken corpus is a large database of audio files and text transcriptions. A Corpus can be based on any written or spoken data including legal documents, interviews, and social media. Speech Corpus is crucial to train and test voice AI models. Check out the most known open-source speech corpus for speech-to-text, open-source speech corpus for natural language understanding, and open-source speech corpus for keyword spotting.
Speech Enhancement:
Speech Enhancement, also known as Noise Suppression, consists of methods and techniques aiming to improve speech quality in terms of intelligibility. You can use Koala Noise Suppression to enhance the speech quality and user experience.
Speech Intelligibility:
Speech Intelligibility shows the percentage of speech that a listener can understand. Various factors, such as the articulation of a speaker or background noises, affect Speech Intelligibility.
Speech Quality:
Speech Quality is a metric used to measure speech enhancement.
Speech Recognition:
Speech Recognition is a common name for technology and methodologies that convert unstructured voice data into structured text. While speech data cannot be recognized by computers directly, trained AI algorithms recognize and transform them into text for human-computer interactions or analysis. Despite the common myth, running speech recognition software does not require powerful computers, even a web browser, Raspberry Pi, or an MCU would work.
Speech-to-Text (STT):
Common name for technology and methodologies that convert unstructured voice data into structured text. It’s also known as Automatic Speech Recognition (ASR) and Open-domain Large Vocabulary Speech Recognition. Using hybrid models or end-to-end speech-to-text models affects the performance of the voice products.
Picovoice offers two speech-to-text engines. Leopard Speech-to-Text for recordings and Cheetah for real-time streaming. Both Leopard and Cheetah are on-device speech-to-text engines that process voice data locally, resulting in private, accurate, and fast experiences with zero latency.
Speech-to-Intent:
Speech-to-Intent is a term coined by Picovoice after developing Rhino. It’s the best Speech-to-Text alternative to develop use-case-specific voice assistants. Rhino infers intents directly from speech and applies modern SLU principles. The name refers to the direct conversion of speech to intent without the text in between.
Speech-to-Index:
Speech-to-Index indexes speech directly without relying on a text representation. It is the best Speech-to-Text alternative to search. Picovoice coined this term after developing Octopus. Text-based indexing and search algorithms do not work accurately for audio search applications. Thus, Octopus was initially developed as a response to this market demand.
Stemming:
Stemming is a method used in NLP and NLU. It reduces a word to its word stem. In simple terms, it chops off the ends of words and removes derivational affixes (suffixes and prefixes).
T
Text-to-Speech (TTS):
Text-to-Speech (TTS), also known as voice generation, is a technology that converts text to artificially produced speech. Orca Text-to-Speech is Picovoice’s text-to-speech engine that converts written text into spoken audio output.
Token:
A token is a meaningful unit of text, typically a word, part of a word, or a sequence of characters that carries semantic value. The term has been popular with LLMs as it serves as the basic building block for understanding and generating language.
Trigger Word:
A trigger word is another term used for a wake word, wake-up word, hotword, and triggering word, which is a special form of KWS. Trigger word refers to a special phrase recognized by an application and initiates, i.e. triggers a process. It’s mostly used to activate a dormant application, hence it triggers an application to listen to further commands. There are not many vendors offering trigger word. Porcupine Wake Word is Picovoice’s trigger word engine and developers’ favorite!
True Negative:
True negative or true rejection indicates the rejection of a condition when it is not there.
True Positive:
True positive or true acceptance indicates the acceptance of a condition when it is there.
U
Utterance:
An utterance means a spoken word or statement. It is a continuous piece of speech beginning and ending with a clear pause. Utterance is one of the most commonly used NLU terms.
User-centered Design:
User-centered design is a process of developing a product or service by keeping users at the core of the development process. Picovoice encourages organizations to follow user-centered design principles while developing voice products by offering an easy-to-use console that enables an iterative development process. We listed five tips to apply design thinking principles while building voice user interfaces and got five tips from conversation design expert Erika Hall.
V
Voice Activation:
Voice activation allows users to activate applications simply by talking, i.e. using their voice, instead of a touchscreen or buttons. It could be done via wake word, Porcupine or voice activity detection, Cobra. Learn more about voice activation and how to enable it.
Voice Activity Detection:
Voice activity detection or VAD in short is the technology used to detect the presence of a human voice and distinguish it from other sounds or noises. Check the What’s Voice Activity Detection article to learn more or the voice activity detection benchmark to compare Cobra with webRTC VAD. If you’re ready to build, check this Python tutorial or start with your favourite SDK.
Voice Assistant:
A voice assistant is a piece of software that responds to questions or commands to carry out duties or provide services on behalf of a user. It recognizes and reacts to voice instructions using artificial intelligence (AI) and machine learning, giving users a more intuitive and natural method to interact with their devices.
Voice Biometrics:
Voice biometrics is the technology that identifies specific markers within audio data. It’s like an audio version of a fingerprint that is unique to the person’s identity. Voice Biometrics is also known as for Voice Identification, Speaker Recognition and Voice Verification. Try Eagle Speaker Recognition, Picovoice’s voice biometrics engine!
Voice Commands:
Voice Commands are used to perform specific actions while interacting with machines. Users may prefer to interact with machines by uttering commands instead of touching or typing when they:
- Multitask - as in industrial voice assistants
- Perform straightforward tasks - as in IVRs
- Use companion apps - as in eldercare
- Cannot type or touch - as in AR, VR, XR
- Care about hygiene - as in vending machines
Voice User Interface (VUI):
Voice User Interface (VUI) is an interface just like the Graphic User Interface (GUI). It allows users to interact with machines with voice. VUI responses can be triggering an action, retrieving information, or a process to complete an end-to-end task. We listed the challenges of building VUIs on mobile, five tips to apply design thinking principles while building voice user interfaces, and five tips from conversation design expert Erika Hall.
W
Wake Word:
Wake word, wake-up word, or hotword is a module that is used to trigger an action. Most of the time, wake words are taken for granted, as they are short and simple voice commands. However, a good wake word detection should be use-case specific, accurate, and run on the device.
Do not forget to check out Picovoice’s guide on selecting a wake word and train your wake word with Porcupine when you’re ready, even if you want to run on an MCU or in a web browser.
Wake Word Detection:
Wake word detection refers to a task recognizing an utterance, a special phrase to activate a device or an application. Wake word detection is an application of always listening commands and KWS. Porcupine is Picovoice’s wake word engine. Do not forget to check out the open-source wake word benchmark before building.
Wake-up Word:
Wake-up word, or WuW in short, is another term for wake word, hotword, or trigger word and is a special application of KWS. Learn about Keyword Spotting and the differences between these terminologies in detail.
webRTC VAD (Voice Activity Detection):
webRTC VAD is a module within Google’s free and open-source webRTC initiative. webRTC provides real-time communication capabilities via APIs, and the voice activity detection module is used to classify whether an audio stream consists of voice data. webRTC VAD is mainly for telecommunication applications. Check out webRTC VAD’s performance compared to Cobra VAD.
WebSocket:
WebSocket is a two-directional protocol that transmits data in both directions on a signal carrier simultaneously.
Word Error Rate:
The word error rate (WER) is the most common metric to evaluate speech recognition engines. WER is calculated by dividing the sum of the errors by the total number of reference words. In other words, WER shows how close the real output (i.e., transcription by an ASR) is to the intended output (i.e., the original text). We prepared a beginners’ guide and things to know about WER and how to improve speech-to-text accuracy.
Don’t forget to compare Picovoice Speech-to-Text WER with alternatives.