Noise suppression and speech enhancement are interchangeable. Noise suppression engines suppress noise, as the name suggests, resulting in enhanced speech. Researchers use Speech Quality
and Speech Intelligibility metrics to measure speech enhancement. This article covers the most known and used speech quality metrics: Mean Opinion Score (MOS)
, Perceptual Evaluation of Speech Quality (PESQ)
, Perceptual Objective Listening Quality Analysis (POLQA)
, 3-fold Quality Evaluation of Speech in Telecommunications (3QUEST)
, and Non-intrusive Objective Speech Quality Assessment (NISQA)
.
What’s the Mean Opinion Score (MOS)?
MOS
is a commonly used but subjective method for evaluating speech quality. It has a straightforward approach: Gather a diverse group of people (test subjects), ask them to listen to recordings, and rate the quality on a scale of 1 to 5. Then average the scores, hence the name mean opinion score. ITU offers some recommendations for designing, running, and reporting the experiments to minimize subjectivity. However, it is still a subjective assessment of sound quality (or impairment). Thus, It’s critical to understand the experiment conditions before the results, as they’re easy to manipulate.
MOS
is time-consuming and expensive. Minimizing bias and achieving more reliable results require tests with more listeners (test subjects) from diverse backgrounds, taking time and effort. Thus, automated tests that serve as computational proxies for MOS
are popular, as well.
What’s Perceptual Evaluation of Speech Quality (PESQ)?
PESQ
is a commonly used metric with the source code on the ITU website. It aims to measure speech quality after passing through the network and codec distortions, not directly for noise suppression quality. More importantly, it’s from the early 2000s and does not reflect the state-of-the-art. Therefore, ITU replaced it with POLQA
.
What’s Perceptual Objective Listening Quality Analysis (POLQA)?
ITU released POLQA
as the successor of PESQ
. Similar to PESQ
, POLQA
is a measure of the quality after network and codec distortions. PESQ
and POLQA
use a computational approximation of MOS
.
What’s 3-fold Quality Evaluation of Speech in Telecommunications (3QUEST)?
3QUEST
is another metric recommended by ITU. It’s to measure the background noise in a transmitted signal. It also uses subjective determination, MOS
, not just for speech but also noise and the overall quality of the sound. S-MOS
refers to Speech-MOS
, N-MOS
to Noise-MOS
, and G-MOS
to General-MOS
or Global-MOS
.
What’s Non-intrusive Objective Speech Quality Assessment (NISQA)?
NISQA
provides predictions on overall speech quality and speech quality dimensions: Noisiness, Coloration, Discontinuity, and Loudness. Unlike the intrusive models mentioned above, NISQA
is a non-intrusive method, meaning it does not require clean reference data for the calculation.
There is no quality metric widely accepted by the industry or researchers to evaluate the performance of speech enhancement engines because each quality metric has challenges with specific distortion types. For example, PESQ
correlates with distortions introduced by telecommunication networks, making it a better choice to measure the speech quality of speech enhancement engines. Yet it still fails to predict the effects of background distortion compared to signal distortion. Thus, it’s crucial to understand the shortcomings of each measure to make a well-informed decision. Besides speech quality measures, researchers use speech intelligibility metrics such as Speech Intelligibility Index (SII)
, Speech Transmission Index (STI)
, and Short-Time Objective Intelligibility (STOI)
. We have another article reviewing the speech intelligibility metrics.
Alternatively, you can score the quality by listening to the audio files, similar to MOS
. Below, there’s a short audio clip showing the difference between the no-noise suppression engine, with Mozilla RNNoise, and Koala Noise Suppression engine scenarios.
Also, you can integrate Koala Noise Suppression with a few lines of code and evaluate its performance by using MOS
with a few users.