Glossary

A


Accuracy

The degree to which the result of the performance of the model. It’s measured by the ratio of correct predictions over the total number of predictions. The higher the accuracy is, the better a model performs.

Acoustic Echo Cancellation

The acoustic echo cancellation (AEC) technique is designed to filter out unwanted sounds such as echoes or reverberation. Acoustic Echo Cancellation may be required when voice is generated from a far end such as a loud speaker or reverberant space. Picovoice products are designed to function robustly in presence of noise and reverberations. However, depending on the target environment Acoustic Echo Cancellation solutions provided by the front-end audio companies might be required for better performance.

Acoustic-only approach

Acoustic only approach processes speech directly without relying on another form of representation including text. It removes the out-of-vocabulary limitation and eliminates the problem of the competing hypothesis such as homophones. [See: Octopus]

Alexa

Alexa is Amazon’s wake word and digital assistant technology that was released in 2013. Alexa is capable of voice interactions and real-time information gathering by interacting with the cloud. While Alexa is the best-known digital assistant among end-users, it’s also been a controversial name due to privacy issues.

Alexa Skill

Alexa Skills are like applications that third-party developers build for Alexa. Amazon offers Alexa Skills Kit (ASK) to enable developers to build Alexa skills. Skills are published in Alexa Skills Stores after a certain certification process. Alexa Skills are equivalent to Google Actions for Alexa-empowered products.

Automatic Speech Recognition (ASR)

Common name for technology and methodologies that convert unstructured voice data into structured text. While speech data cannot be recognized by computers directly, trained AI algorithms recognize and transform them into text for human-computer interactions or analysis. [See: NLP (Natural Language Processing), NLU (Natural Language Understanding), Open-domain Large Vocabulary Speech Recognition, Speech-to-Intent, Speech-to-Text, Rhino, Cheetah, Leopard]

B


Benchmark

Benchmark is a tool to evaluate relative performances of hardware or software products on certain tasks by running standard tests and experiments.

While developing our products, we noted that every company was claiming that their products were the best or most accurate voice recognition product on the market. To empower customers to make data-driven decisions, we developed benchmark frameworks for all of our products and shared them on our GitHub. We share not only the results of the tests we run but also the required files and documentation to reproduce them.

Branded Wake Word

Branded wake words, custom wake words, or branded custom wake words are wake words trained for specific purposes or brand uses. Enterprises train their own branded wake words instead of using other brands such as Alexa, Okay Google or Hey Siri. Read our tips for choosing a wake word. [See: Wake Word, Porcupine]

Bixby

Bixby is Samsung’s virtual assistant similar to Alexa or Okay Google in 2017 to replace S Voice.

Built-in Slots

Built-in slots are pre-defined slots to handle and recognize data for common requirements such as letters, numbers, ordinal numbers. These slots are prefixed with 'pv.' to distinguish them from custom slots. Built-in slots types are pv.Alphabetic, pv.Alphanumeric, pv.Percent, pv.SingleDigitInteger, pv.SingleDigitOrdinal… and more examples and information can be found here [See: Slots]

C


Cheetah

Cheetah is the first and only commercially available and supported real-time local Speech-to-Text (STT) engine on the market. Cheetah doesn’t require connectivity to transcribe voice data to text. It outperforms competing cloud-based or offline Automatic Speech Recognition (ASR) solutions by wide margins. [See: Automatic Speech Recognition, Open-domain Large Vocabulary Speech Recognition]

Console (Picovoice Console)

The Picovoice Console is a self-service & user-friendly interface to build voice-enabled products with Picovoice. It’s where the magic happens. Start building voice products

Cobra

Cobra is the voice activity detection engine that detects the human voice and distinguishes it from other voices or noises. It was initially developed as an internal tool, given its superior performance we made it publicly available. [See: Voice Activity Detection]

Context

A context consists of a set of expressions (spoken commands), intents, and slots (intent arguments) within a domain of interest. From a developer perspective, a context for a "smart lighting system" is built by using intents (turn on, turn off), slots (room: living room, bedroom, kitchen) and expressions (turn on the “room” lights). From an end-user perspective, a context understands the intent (turn on, living room) of the voice commands (turn on the living room lights) for controlling lights in a home.

Cortana

Cortana is Microsoft’s voice assistant similar to Alexa and Google Assistant. It was launched in 2014. Microsoft ended the Cortana support for various platforms including iOS, Android and its own Surface Headphones and removed them from the marketplaces in 2021.

Cross-platform

In computing, cross-platform, multi-platform or platform-independent refers to software that is developed to work across various computing platforms. All Picovoice products are cross-platform and support Raspberry Pi, BeagleBone, Arm Cortex-M, Arduino, NVIDIA Jetson, Android, iOS, Linux, macOS, Windows, and modern web browsers. [See: Design once, deploy anywhere]

D


Deep learning

Deep learning, deep structured learning or hierarchical learning is a type of machine learning and artificial intelligence. Deep learning mimics the way how human brain works and humans gain certain types of knowledge. Learning can be supervised, semi-supervised or unsupervised.

Design thinking

Design thinking is a process to develop a product or service by empathizing with users and prioritizing their needs and pain points. It’s an iterative process that includes observation, reframing problems, ideation, and testing to make sure the solution is what end-users want.

Picovoice encourages organizations to follow design thinking principles while developing voice products by offering an easy-to-use console that enables an iterative development process. [See: User-centered design]

Design once, deploy anywhere

Design once, deploy anywhere or design once, deploy everywhere refers to software that is developed to work across various computing platforms. All Picovoice products are cross-platform and support Raspberry Pi, BeagleBone, Arm Cortex-M, Arduino, NVIDIA Jetson, Android, iOS, Linux, macOS, Windows, and modern web browsers. [See: Cross-platform]

E


Eavesdrop

It’s a verb for secretly listening to a conversation. It’s been associated with smart speakers such as Amazon Echo, Google Home, Apple Homepod after consumers learned that their conversations were recorded. [1] In 2020, a former Amazon executive revealed that he’d switch off his smart speaker when he had private conversations. [2]

Echo (Amazon Echo)

A smart speaker was released by Amazon in 2014. It has become the brand name Amazon smart speakers since then.

Edge Computing

Edge computing refers to an architecture where computing or storing data is done at or near the source, instead of relying on the cloud. Both cloud and edge computing offer different advantages. Edge computing becomes more advantageous when it comes to latency, bandwidth, security and privacy. [See: On-device processing]

Edge Voice AI

Edge voice AI refers to the technology of processing voice data on the device where AI-trained models are deployed. Since there is no internet connectivity required to process voice data, it works offline and ensures privacy. Better performance and user experience are achieved by eliminating latency and connectivity issues and minimizing power consumption. On battery-powered devices, Internet access is a major power drain. (especially on LTE and WiFi). Edge Voice AI also offers cost-effectiveness at scale since voice commands are not charged based on per API call. [See: Local Speech Recognition]

End-pointing

The challenge of guessing when the user is done in spoken language understanding. It becomes more challenging especially when there is noise even. A special kind of noise is a second speaker or additional speakers. In order to address this challenge, Rhino comes with an optional feature that allows voice command recognition even in presence of overlapping speech.

Expression

Intents are composed of a collection of expressions. When a user's utterance matches any expression within an intent, the intent is detected. For example, "make coffee" or "make me a coffee" could be expressions, each of which signal the "Make Coffee" intent.

F


False Acceptance Rate (FAR)

False accept or false positive indicates the presence of a condition when it’s not there. For example, researchers found out that although “Okay/Hi/Hey Google” are used as wake words for Google Home Mini, sentences like “I can spare” and “I don’t like the cold” activated the Google smart speaker. [3] It’s considered a false acceptance. A system's false acceptance rate (FAR) is the ratio of the number of false acceptance divided by the total number of attempts,

False Rejection Rate (FRR)

False rejection or false-negative indicates the absence of a condition when it is actually present. For example, when you say “Alexa” to activate a smart speaker, if it misses then it’s considered a false reject. A system's false rejection rate (FRR) is the ratio of the number of false rejections divided by the total number of attempts

Far-Field Speech Recognition

The main difference between near-field speech recognition and far-field speech recognition is the distance between the source and compute device. While smart speakers process voice data from far [far-field speech recognition], mobile phones process it when the source is close. [near-field speech recognition] Far-field speech recognition depends on many factors including the distance, ambient noise level, reverberation (echo), quality of the microphone, and audio frontend used (if any). While Picovoice products are tested in various environments, it is recommended to try out our technology in the target environment. If the target environment is noisy and/or reverberant and the user is a few meters away from the microphone, a multi-microphone audio frontend can be beneficial.

G


Google Actions

Google Actions are like applications that third-party developers build for Google Assistant. Google offers Actions Builders and Actions SDKs to enable developers to build Google Actions. Google Actions are equivalent to Alexa skills for Google Assistant.

Google Assistant

Google Assistant is Google’s virtual assistant released in 2016 to replace Google Now. It’s Google's version of Alexa by Amazon, Siri by Apple, Cortana by Microsoft, Bixby by Samsung.

Google Nest

Google Nest is the brand name for Google’s smart home products including Google Home smart speakers, streaming devices and Nest thermostats. Google Assistant-powered Google Home smart speakers such as Google Home, Google Home Hub, and Google Home Mini are rebranded under Google Nest in 2018.

Grapheme-to-phoneme

Grapheme-to-phoneme (G2P) is a task of converting letters (grapheme sequence) to their pronunciations (phoneme sequence).

H


Hyper-customization

Etymologically hyper means excessive, hyper-customization or hyper-personalization refers to the flexibility and ability to offer products or services for individual user wants and needs. Picovoice enables organizations to develop voice products on their terms with their brands and their use cases.

Homophone

A homophone is a noun used to describe the words with the same sound, i.e. pronunciation but different meanings, or spelling. For example, to, too, and two are known as homophones. Homophones are a widely known challenge in voice recognition.

Homepod

A smart speaker was released by Apple in 2018 and discontinued in 2021. Homepod Mini which was released in 2020 replaced by Homepod.

I


In-car Entertainment

In-car entertainment (ICE) or In-vehicle infotainment (IVI) is a term used for a combination of hardware and software to provide control and entertainment to drivers. These systems include touch-free voice control, touch screens, touch-sensitive panels, or steering wheel controls.

Intent

An Intent in speech recognition refers to the intention of a user, what they mean. For example, when a user says “get me a large americano” “I want a large americano” “can you please make me a large americano” the intent of the user is to order a cup of coffee.

Interactive Voice Response (IVR)

Interactive Voice Response (IVR) is an automated system that enables callers to interact with the host system through pre-recorded voice responses via a telephone keypad or speech recognition. For example, to reset their password, a user may receive a pre-recorded voice prompt to dial 4 or any other number, or a prompt to direct them to tell what they want to achieve. When a user says “reset my password” or dial 4, their call gets routed to a menu or a specialist. IVR is used to minimize the number of human agents and offer a better service to the users.

K


Keyword Spotting

Keyword Spotting, hot word spotting is a term for detecting a key phrase within a stream of audio—is a simple yet powerful form of voice recognition. Voice activation or wake word detection (e.g. “Alexa” or “OK Google”) is a special form of keyword spotting.

A keyword spotter can also be used to create always-listening voice commands. The main benefit of using always-listening voice commands (versus follow-on commands) is user convenience, as it is not required to utter the wake phrase first. For example, a music player can immediately adjust the volume via always listening commands such as “volume up” or move within a playlist using “play next”.

L


Latency

In computing, latency refers to the delay in passing data to pass from one point to another one. From a user experience point of view, it’s friction as a user's action and a web application's response to that action is not performed in a timely manner. For some applications, such as AR/VR or autonomous devices such as cars, even milliseconds of latency hinders the experience or causes fatal results.

Leopard

Leopard is the local Speech-to-Text (STT) engine for audio files. Leopard doesn’t require internet connectivity to transcribe voice data to text. It outperforms competing cloud-based or offline Automatic Speech Recognition (ASR) solutions by wide margins. [See: Automatic Speech Recognition, Open-domain Large Vocabulary Speech Recognition]

Lexicon

Lexicon means a user's knowledge of words. The Economist found out that most adult native speakers know 20,000-35,000 words. [4] For Picovoice products, it’s above 200,000 words in English. If a word that is present in an English dictionary is missed, reach out to the Picovoice team by creating a GitHub issue to help us improve our products.

Local Speech Recognition

Local speech recognition refers to the technology of processing voice data on-device locally where AI-trained models are deployed without sending the data to the cloud. Since there is no internet connectivity required to process voice data, it works offline and ensures privacy. Better performance and user experience are achieved by eliminating latency and connectivity issues and minimizing power consumption. On battery-powered devices, internet access is a major power drain. (especially on LTE and WiFi). Local speech recognition also offers cost-effectiveness at scale since voice commands are not charged based on per API call. [See: Edge Voice AI]

M


Microphone array beamforming:

Beamforming processes signals from multiple omnidirectional sources (i.e. microphones) to focus on the most prominent sound (i.e. user voice) and disregard the other sounds (i.e. noises)

Model:

A model in artificial intelligence refers to software that is trained on certain data to perform specific tasks such as recognizing speech, or patterns to achieve specific and pre-defined goals. Picovoice products are trained in real-world environments by using deep neural networks. Picovoice Console enables users to train their own voice models by using Picovoice products.

Multi-modality:

Multimodal interfaces offer users multiple interaction points. For example, you can activate a smart speaker by touching on them or using a wake word. Another multi-modal example is voice typing, while using dictation on your mobile phone, you can see what’s written and select the written text via touch or voice command.

N


Natural Language Understanding (NLU):

Natural Language Understanding (NLU) is a subfield of Natural Language Processing (NLP). NLU deals with machine reading and comprehension in order to understand the intent (i.e.) meaning of what it reads.

Natural Language Processing (NLP)

Natural language processing is a field combination of linguistics, computer science, and artificial intelligence that deals with the interactions between computers and humans via spoken language inputs of text or audio.

Neural Networks:

Artificial Neural Networks (ANN) or neural networks in short refers to an artificial network of neurons or nodes for solving artificial intelligence (AI) problems. Artificial neural networks are inspired by biological neural networks which constitute the brain. Neural networks are trained by processing examples that contain a known "input" and "result”. Neural networks “learn” by forming probability-weighted associations between inputs and results. After determining the error, the difference between the processed output of the network (e.g. prediction) and a target output, the network adjusts weighted associations accordingly by using error values.

No-code development

No-code software development platform or no-code development in short enables both developers and non-programmers to create applications through graphical user interfaces without writing a code as in traditional computer programming. No-code platforms have become popular among both developers and non-developers.

Noise suppression:

Noise suppression is like a filter to remove distracting ambient noises such as keyboard typing or fan noise to create a better experience.

No-input Error:

No-input error is a type of error when an automatic speech recognition (ASR) doesn’t detect the speech input although it exists.

No-Match Error:

No-match error is a type of error when an automatic speech recognition (ASR) cannot match the speech input with the responses that it expects or knows.

O


Octopus

Octopus is Picovoice’s Speech-to-Index Engine which indexes speech directly without relying on a text representation. Octopus's acoustic-only approach boosts accuracy by removing the out-of-vocabulary limitation and eliminating the problem of the competing hypothesis. An open-source framework for benchmarking different engines is available on Picovoice GitHub [See: Speech-to-Index Engine]

On-device processing:

On-device processing or on-device AI refers performing inference with models directly on a device such as a mobile app, web browser, or an MCU. The machine learning model processes input data such as images, or audio on-device rather than sending the data to the cloud and processing the data there. [See: Edge Computing]

Open-domain Large Vocabulary Speech Recognition

A Speech-to-Text (STT) is the technology that is required when a voice use case is not limited to a given domain or confined within a fixed set of commands. A familiar example is question-answering when the user can ask the voice assistant anything. Furthermore, Open-domain Large Vocabulary Speech Recognition (STT) is useful when there is an inherent interest in capturing the transcription, such as meetings, note-taking, and voice typing.

P


Phoneme:

A phoneme is the unit of sound. As words are written with letters, they are pronounced, spoken with phonemes. For example, the word "speech" has six letters and four phonemes (s-p-E-ch) phonemes.

Porcupine

Porcupine, the wake word engine, trains and understands custom branded wake words and simple commands. Porcupine could be used for simple voice commands such as “Open Sesame” to open a magical door or activating a dormant device to listen to further commands as Alexa does. Porcupine offers a true hands-free experience by replacing the need for a push-to-talk (PTT) button and by enabling always-listening commands for rapid interactions without any friction. Porcupine is the only wake word engine that works across multiple platforms including web browsers to offer the true hands-free experience in all touchpoints.

Prototyping:

Prototyping is a process where teams implement ideas into tangible forms at varying degrees of fidelity to capture concepts and test on users. Prototypes enable teams to refine ideas and products to release the right products to the right audience. Picovoice Console enables users to train models that are instantly available to be tested on the web. It helps with the voice prototyping process to build voice user interfaces that users want.

Push-to-talk

Push-to-talk, also known as press-to-transmit, is a method of initiating voice transmission by switching from the reception mode. Push-to-talk buttons were first invented used with radio devices such as walkie-talkies, then adopted by other form factors such as mobile phones or smart speakers. Some multi-modal solutions that work on mobile or web use soft push-to-talk buttons. In the last few years, in order to offer a truly hands-free experience to the users, push-to-talk buttons have been replaced by wake words. However, due to the complexity of wake word detection technology, most voice vendors are not able to offer wake-word engines that run across platforms. Therefore it slows the replacement of push-to-talk buttons, even for the use cases where users would benefit from wake words.

In order to replace push-to-talk buttons with wake words, a wake word model should run on everything, but the cloud. It’s mainly because they have to respond quickly, and nobody wants everything they say to be transmitted to the cloud for an NLU engine to detect whether spoken utterances include the wake word. Moreover, organizations do not want to bear high cloud bills as a result of the transmission of tremendous voice data. Building a wake word engine that runs on-device is not enough to replace push-to-talk buttons. The model sizes should be small, too. For example, adding a large AI model to a web browser adversely affects the load time, hence worsened user experience. Porcupine is the first and still the only wake word engine that works across platforms to replace push-to-talk buttons. [See: Porcupine, Wake Word Engine]

R


Resource-efficient:

Resource-efficiency is a term for maximum output given minimal resources. Resource-efficient is used for products that consume minimal resources. Picovoice products consume minimal resources and prolong battery life.

Reverberation

Reverberation in acoustics is the persistence of sound like an echo after a sound is produced. Reverberation is created when a reflected sound is built up and then decayed as the sound is absorbed by the surfaces in the space such as furniture, or air.

Rhino

Rhino is a context-aware NLU Engine. It directly infers intent from spoken commands within a given context of interest, in real-time whereas most NLU engines infer the intent from text. By eliminating the need for text representation, Rhino responds faster and more accurately.

ROC:

A receiver operating characteristic (ROC) curve plots true positive rates (TPR) against false-positive rates (FPR) at various sensitivity values. The larger the area under the curve is the better the accuracy of the product is.

S


Sensitivity:

Sensitivity is a value between 0.0 and 1.0. While a value of 0.0 suppresses all audio of 1.0 suppresses no audio. A higher sensitivity value gives a lower miss rate at the expense of a higher false alarm rate. One should pick a sensitivity parameter that suits applications' requirements.

Shepherd:

Picovoice Shepherd is the first no-code platform for building voice interfaces on microcontrollers. Picovoice Shepherd accelerates prototyping, mitigates technical risks, and shortens time-to-market. Paired with Picovoice Console, users can deploy custom voice models onto microcontrollers instantly. [See: No-code development]

Siri:

Siri is Apple’s digital assistant technology and wake word (Hey Siri) was released in 2011. Siri is capable of voice interactions and real-time information gathering by interacting with the cloud.

Slot:

A slot is a set of specific pieces of information from an utterance to help understand the intent. Slots are accessible from various intents in the same model. For example, a slot can have different locations in a house such as living room, kitchen, bedroom and could be accessed from different intents such as “turn on lights”, and “turn off lights” in a model trained for a smart home application.

Speech recognition:

Common name for technology and methodologies that convert unstructured voice data into structured text. While speech data cannot be recognized by computers directly, trained AI algorithms recognize and transform them into text for human-computer interactions or analysis.

Speech-to-Text (STT):

Common name for technology and methodologies that convert unstructured voice data into structured text. [See: Automatic Speech Recognition, Open-domain Large Vocabulary Speech Recognition, Cheetah, Leopard]

Speech-to-Intent:

Speech-to-Intent is the technology that directly infers intent from spoken commands within a given context of interest. [See: Rhino]

Speech-to-Index:

Speech-to-Index is the technology that indexes speech directly without relying on a text representation. [See: Octopus]

T


Text-to-Speech (TTS):

Text-to-Speech (TTS) is a technology that converts text to artificially produced speech.

True negative:

True negative or true rejection indicates the rejection of a condition when it is not there.

True positive:

True positive or true acceptance indicates the acceptance of a condition when it is there.

U


Utterance:

An utterance means a spoken word or statement. It is a continuous piece of speech beginning and ending with a clear pause.

User-centred Design

User-centred design is a process to develop a product or service by keeping users at the core of the development process. Picovoice encourages organizations to follow user-centred design principles while developing voice products by offering an easy-to-use console that enables an iterative development process. [See: Design thinking]

V


Voice Activation

Voice activation or voice control allows the user to activate or control applications by simply using their voice, instead of using a touchscreen or buttons.

Voice Activity Detection

The voice activity detection engine detects the human voice and distinguishes it from other voices or noises. [See: Cobra]

Voice Biometrics:

Voice biometrics is the technology that identifies specific markers within audio data. It’s like an audio version of a fingerprint that is unique to the person’s identity. Voice Biometrics is used for Voice Identification and Voice Verification.

Voice search may have two meanings. Search within voice recordings or voice-enabled search.

Voice-enabled search allows users to search queries by using voice commands instead of typing. Voice search can be used to replace type search or as a complimentary for multimodal interfaces. [See: Cheetah, Speech-to-Text Engine, Open-domain Large Vocabulary Speech Recognition, Rhino, Speech-to-Intent]

Voice search, i.e. search within voice recordings requires unstructured voice data to be structured. One approach to structure voice data is to transform voice into text by using an STT engine. [See: Speech-to-Text Engine, Automatic Speech Recognition (ASR), Open-domain Large Vocabulary Speech Recognition, Leopard]

Another, more accurate approach is to structure voice data is indexing to make it searchable. [See: Octopus, Speech-to-Index Engine, Acoustic-only Approach]

Voice User Interface (VUI):

Voice User Interface (VUI) is an interface just like Graphic User Interface (GUI) and allows users to interact with machines with voice. The process to build a VUI shouldn’t be different from building a GUI. It should be an iterative process. Self-service Picovoice console empowers user-centric iterative development.

W


Wake word:

Wake word, wake-up word, or hot word is a module that is used to trigger an action. Most of the time, wake words are taken for granted, as they are short and simple voice commands. However, considering the fact that they’re always listening for a particular and very short utterance to activate dormant applications and that they can run on everything from a tiny MCU to a web browser. They should be trained on a diverse data set, to work across genders, dialects and accents. [See: Porcupine]

Word Error Rate

The word error rate (WER) is the most common metric to evaluate speech recognition engines. It is a measure of the average number of word errors considering three error types: substitution (the reference word is replaced by another), insertion (a word is hypothesized that was not in the reference) and deletion (a word in the reference transcription is missed). The word error rate is defined as the sum of these errors divided by the number of reference words. Given this definition, the percent word error can be more than 100%. The WER is proportional to the correction cost.


Issue with this doc? Please let us know.