Speech Enhancement
is a technology that improves the clarity and intelligibility of speech signals in noisy environments by suppressing the noise. Thus, Noise Suppression
, Noise Cancellation
, and Speech Enhancement
are interchangeable.
What does Speech Enhancement do?
Speech Enhancement
helps people communicate more effectively and efficiently by muting the background noise. It is useful when background noise can interfere with speech communication, such as in crowded public spaces, busy call centers, or online meetings. Thus, a Speech Enhancement
engine is necessary for developers building communications solutions.
There are three crucial things that one should know about Speech Enhancement
.
1. Latency must be minimal for real-time speech enhancement.
A recent study discovered that the human ear detects half a millisecond delay in sound. ITU G.114 states the Latency shouldn’t exceed 200ms. Otherwise, we start talking over each other, lose our attention and become annoyed. The end-to-end Latency
consists of three factors: network, compute, and codec. Codec Latency
depends on codecs and their modes, yet modern codecs have become quite efficient. Network Latency
has the most substantial impact on the experience.
- It should run on the device for zero network latency: Running on the edge eliminates network and connectivity-related
Latency
. NetworkLatency
, congestion, outages, and throttling can affect the performance of cloud-dependent applications and hinder the experience. Thus, cloud computing with high and unpredictable networkLatency
is not a good fit. - It should be computationally efficient for minimal compute latency: Small and efficient models with minimal resource requirements process data with minimal
Latency
and can run on many platforms.
2. It must be effective against both stationary and non-stationary noises.
Let's distinguish stationary and non-stationary noise first. Stationary noise is constant and predictable, such as wind. Non-stationary noise, such as short and loud sounds like traffic with horns, sirens, or keyboard typing, has complicated and irregular patterns that are hard to differentiate. A High-quality Speech Enhancement
engine should be able to remove non-stationary noises, as well.
3. There are many solutions for end-users but fewer for developers.
- Application-specific solutions: Microsoft Teams, Zoom;
- Hardware-specific solutions: NVIDIA RTX, AMD;
- Platform-independent solutions: Krisp, Audacity;
- Engines: Krisp for Developers, open-source Mozilla RNNoise;
Traditional digital signal processing models are small and efficient (low Latency
) but have poor Quality
. Deep learning models generally offer higher Quality
but with large models (higher Latency
) and limited platform support. Building any technology is easier when there is a specific platform or requirement (offline only). However, developers work on different platforms, use various SDKs, and have diverse needs. The trade-off between Quality
and Latency
limits the developers’ options.
Picovoice Koala Noise Suppression provides high-quality noise suppression in real time with minimal Latency
and runs across platforms. Sounds too good to be true? Test it yourself. Picovoice’s Free Plan does not require credit card information or any commitment.