Speech Enhancement is a technology that improves the clarity and intelligibility of speech signals in noisy environments by suppressing the noise. Thus,
Noise Cancellation, and
Speech Enhancement are interchangeable.
What does Speech Enhancement do?
Speech Enhancement helps people communicate more effectively and efficiently by muting the background noise. It is useful when background noise can interfere with speech communication, such as in crowded public spaces, busy call centers, or online meetings. Thus, a
Speech Enhancement engine is necessary for developers building communications solutions.
There are three crucial things that one should know about
1. Latency must be minimal for real-time speech enhancement.
A recent study discovered that the human ear detects half a millisecond delay in sound. ITU G.114 states the Latency shouldn’t exceed 200ms . Otherwise, we start talking over each other, lose our attention and become annoyed. The end-to-end
Latency consists of three factors: network, compute, and codec. Codec
Latency depends on codecs and their modes, yet modern codecs have become quite efficient. Network
Latency has the most substantial impact on the experience.
- It should run on the device for zero network latency: Running on the edge eliminates network and connectivity-related
Latency, congestion, outages, and throttling can affect the performance of cloud-dependent applications and hinder the experience. Thus, cloud computing with high and unpredictable network
Latencyis not a good fit.
- It should be computationally efficient for minimal compute latency: Small and efficient models with minimal resource requirements process data with minimal
Latencyand can run on many platforms.
2. It must be effective against both stationary and non-stationary noises.
Let's distinguish stationary and non-stationary noise first. Stationary noise is constant and predictable, such as wind. Non-stationary noise, such as short and loud sounds like traffic with horns, sirens, or keyboard typing, has complicated and irregular patterns that are hard to differentiate. A
High-quality Speech Enhancement engine should be able to remove non-stationary noises, as well.
3. There are many solutions for end-users but fewer for developers.
- Application-specific solutions: Microsoft Teams, Zoom;
- Hardware-specific solutions: NVIDIA RTX, AMD;
- Platform-independent solutions: Krisp, Audacity;
- Engines: Krisp for Developers, open-source Mozilla RNNoise;
Traditional digital signal processing models are small and efficient (low
Latency) but have poor
Quality. Deep learning models generally offer higher
Quality but with large models (higher
Latency) and limited platform support. Building any technology is easier when there is a specific platform or requirement (offline only). However, developers work on different platforms, use various SDKs, and have diverse needs. The trade-off between
Latency limits the developers’ options.
Picovoice Koala Noise Suppression provides high-quality noise suppression in real time with minimal
Latency and runs across platforms. Sounds too good to be true? Test it yourself. Picovoice’s Free Plan does not require credit card information or any commitment.