What to know about Speech Enhancement

🎯 Voice AI Consulting

Get dedicated support and consultation to ensure your specific needs are met.

Speech Enhancement is a technology that improves the clarity and intelligibility of speech signals in noisy environments by suppressing the noise. Thus, Noise Suppression, Noise Cancellation, and Speech Enhancement are interchangeable.

What does Speech Enhancement do?

Speech Enhancement helps people communicate more effectively and efficiently by muting the background noise. It is useful when background noise can interfere with speech communication, such as in crowded public spaces, busy call centers, or online meetings. Thus, a Speech Enhancement engine is necessary for developers building communications solutions.

There are three crucial things that one should know about Speech Enhancement.

1. Latency must be minimal for real-time speech enhancement.

A recent study discovered that the human ear detects half a millisecond delay in sound. ITU G.114 states the Latency shouldn’t exceed 200ms. Otherwise, we start talking over each other, lose our attention and become annoyed. The end-to-end Latency consists of three factors: network, compute, and codec. Codec Latency depends on codecs and their modes, yet modern codecs have become quite efficient. Network Latency has the most substantial impact on the experience.

It should run on the device for zero network latency: Running on the edge eliminates network and connectivity-related Latency. Network Latency, congestion, outages, and throttling can affect the performance of cloud-dependent applications and hinder the experience. Thus, cloud computing with high and unpredictable network Latency is not a good fit.
It should be computationally efficient for minimal compute latency: Small and efficient models with minimal resource requirements process data with minimal Latency and can run on many platforms.

2. It must be effective against both stationary and non-stationary noises.

Let's distinguish stationary and non-stationary noise first. Stationary noise is constant and predictable, such as wind. Non-stationary noise, such as short and loud sounds like traffic with horns, sirens, or keyboard typing, has complicated and irregular patterns that are hard to differentiate. A High-quality Speech Enhancement engine should be able to remove non-stationary noises, as well.

3. There are many solutions for end-users but fewer for developers.

Application-specific solutions: Microsoft Teams, Zoom;
Hardware-specific solutions: NVIDIA RTX, AMD;
Platform-independent solutions: Krisp, Audacity;
Engines: Krisp for Developers, open-source Mozilla RNNoise;

Traditional digital signal processing models are small and efficient (low Latency) but have poor Quality. Deep learning models generally offer higher Quality but with large models (higher Latency) and limited platform support. Building any technology is easier when there is a specific platform or requirement (offline only). However, developers work on different platforms, use various SDKs, and have diverse needs. The trade-off between Quality and Latency limits the developers’ options.

Picovoice Koala Noise Suppression provides high-quality noise suppression in real time with minimal Latency and runs across platforms. Sounds too good to be true? Test it yourself. Picovoice’s Free Plan does not require credit card information or any commitment.

Start Building

Things you should know about Speech Enhancement

What does Speech Enhancement do?

1. Latency must be minimal for real-time speech enhancement.

2. It must be effective against both stationary and non-stationary noises.

3. There are many solutions for end-users but fewer for developers.

More from Picovoice