Noise Suppression Benchmark
This benchmark evaluates how Picovoice Koala compares with the popular Mozilla RNNoise noise suppression engine. Both Koala and RNNoise are lightweight platform-independent SDKs for streams of audio.
Methodology
Noisy Speech Corpus
We consider the synthetic test set of the first installment of the Microsoft DNS Challenge at Interspeech 2020, consisting of 150 noisy test files and their clean reference files. The original data is mixed at a range of various signal-to-noise ratio (SNR) levels. Furthermore, we investigate the performance at specific SNRs by separating the speech from noise and mix them back together at a custom SNR.
Metrics
Short-Term Objective Intelligibility
The performance of a Noise Suppression engine can be measured in multiple ways including Mean Opinion Score (MOS) in listening experiments, as well as objective approximations of MOS such as POLQA or PESQ. In order to make the benchmark as easily reproducible as possible, we select the Short Term Objective Intelligibility (STOI) metric that judges the intelligibility on a scale from 0 to 1, where 1 is best. By definition, the clean reference audio always has a perfect score of 1.
For a concise visualization, we measure the difference between the STOI scores of the denoised audio and the clean reference. This gives the STOI distance to the clean speech on a scale from 0 to 1, where smaller values are better.
Computational Cost and Real-Time Factor
The real-time factor is the ratio of the pure processing runtime of the Noise Suppression algorithm divided by the length of audio. The smaller this value is, the less resources are required to run the algorithm. For enhancing a stream of audio in real-time, it is important that this factor is well below 1 to avoid buffering while still leaving enough resources for other applications.
Results
Intelligibility (STOI) distance to clean speech
The figure below shows the average performance of each engine on the original pre-mixed dataset. The smaller the value, the closer the output is to clean speech in terms of intelligibility.
A more detailed view can be obtained by re-mixing the dataset at a specific noise level:
Real-Time Factor
We measure the run times of both algorithms on an Ubuntu 20.04 machine with Intel CPU(Intel(R) Core(TM) i5-9400F CPU @
2.90GHz
), 64 GB of RAM, and NVMe storage, using a single thread.
For both engines, the real-time factor is independent of the processed data.
Usage
The code used to create this benchmark is available on GitHub under the permissive Apache 2.0 license. Detailed instructions for benchmarking individual engines are provided in the following documents: