Buyers and researchers use several factors to evaluate speech recognition models. Accuracy is generally the number one factor. The Word Error Rate for speech-to-text and Command Acceptance Rate are scientifically approved methods to measure the accuracy. However, the quality of a speech model depends on other factors: Response time, memory usage, power consumption, and noise resilience. Latency is a nonissue for recordings since the transcription happens after the fact. However, it is critical for Real-time processing or communication, as high Latency affects the performance and user experience.

What is latency?

Latency describes the time it takes for a signal to be transmitted and processed. The speed of the hardware and software, the distance the signal must travel, and the amount of traffic on the network contribute to Latency.

Why is latency important for speech recognition?

In speech recognition applications, Latency can be particularly noticeable when there is a delay between when a person stops speaking and a speech model produces a transcript or inference. If the speech recognition system is located remotely from the user, as in the case of cloud speech recognition APIs, there can be delays due to network transmission times. The traffic on the network can also impact Latency, as other signals and data may compete for bandwidth.

What are the consequences of latency in speech recognition?

Have you waited for Alexa to set a 10-second-timer for more than 10 seconds? The delay was due to Latency. Alexa uses the conventional spoken language understanding model. It records the voice data and sends it to Amazon’s cloud to convert it to text, then sends the transcribed text to the NLU engine for intent detection. A similar problem may occur with Real-time transcription as well. When you think of human interactions, there is no Latency. When a person finishes the sentence, another one continues with no delay. Thus, Latency in speech recognition applications doesn’t feel natural and diminishes the experience. For some applications, a few seconds of delay wipe off their value completely. The delay in providing feedback to agents is beyond just a bad user experience for the real-time call center agent or salesperson coaching applications. The consequences of Latency are even more drastic for voice-controlled surgical robots.

How to reduce latency in speech recognition applications?

Remember the root causes of Latency?

  1. speed of the hardware and software
  2. the distance the signal must travel
  3. amount of traffic on the network

Choosing resource-efficient software is as critical as choosing powerful hardware, if not more. Picovoice’s speech-to-text engines are 20 MB, whereas alternatives with similar accuracy are in GBs.

Moving voice data near compute or compute near the data reduces the signal travel distance. However, using servers, hence the cloud, is the only option to run large speech models. Otherwise, the hardware wouldn’t be sufficient to process voice data Fast. We cannot build a solution for data generated or stored only on servers. Several devices, such as mobile phones, desktop computers, or even tiny microcontrollers, generate voice data. Thus, we need to move computing power near the data. It’s called edge computing or on-device processing.

Picovoice is the first and only ubiquitous on-device voice recognition company. Picovoice technology brings computing power to data and eliminates travel distance.

If promptness, i.e. Latency, is crucial for your voice product, try Picovoice Cheetah Streaming Speech-to-Text for Real-time transcription, Rhino Speech-to-Intent for intent detection, and Koala for Real-time noise suppression. All offer predictable and reliable response times. Eagle Speaker Recognition is optimized for real-time, capturing speaker changes immediately. Orca Text-to-Speech is the most efficient engine that generates voice locally on-device, eliminating network latency.

Start Building