Noise suppression and speech enhancement are interchangeable. Noise suppression engines suppress noise, as the name suggests, resulting in enhanced speech. Researchers use
Speech Quality and Speech Intelligibility metrics to measure speech enhancement. This article covers the most known and used speech quality metrics:
Mean Opinion Score (MOS),
Perceptual Evaluation of Speech Quality (PESQ),
Perceptual Objective Listening Quality Analysis (POLQA),
3-fold Quality Evaluation of Speech in Telecommunications (3QUEST), and
Non-intrusive Objective Speech Quality Assessment (NISQA).
What’s the Mean Opinion Score (MOS)?
MOS is a commonly used but subjective method for evaluating speech quality. It has a straightforward approach: Gather a diverse group of people (test subjects), ask them to listen to recordings, and rate the quality on a scale of 1 to 5. Then average the scores, hence the name mean opinion score. ITU offers some recommendations for designing, running, and reporting the experiments to minimize subjectivity. However, it is still a subjective assessment of sound quality (or impairment). Thus, It’s critical to understand the experiment conditions before the results, as they’re easy to manipulate.
MOS is time-consuming and expensive. Minimizing bias and achieving more reliable results require tests with more listeners (test subjects) from diverse backgrounds, taking time and effort. Thus, automated tests that serve as computational proxies for
MOS are popular, as well.
What’s Perceptual Evaluation of Speech Quality (PESQ)?
PESQ is a commonly used metric with the source code on the ITU website. It aims to measure speech quality after passing through the network and codec distortions, not directly for noise suppression quality. More importantly, it’s from the early 2000s and does not reflect the state-of-the-art. Therefore, ITU replaced it with
What’s Perceptual Objective Listening Quality Analysis (POLQA)?
POLQA as the successor of
PESQ. Similar to
POLQA is a measure of the quality after network and codec distortions.
POLQA use a computational approximation of
What’s 3-fold Quality Evaluation of Speech in Telecommunications (3QUEST)?
3QUEST is another metric recommended by ITU . It’s to measure the background noise in a transmitted signal. It also uses subjective determination,
MOS, not just for speech but also noise and the overall quality of the sound.
S-MOS refers to
What’s Non-intrusive Objective Speech Quality Assessment (NISQA)?
NISQA provides predictions on overall speech quality and speech quality dimensions : Noisiness, Coloration, Discontinuity, and Loudness. Unlike the intrusive models mentioned above,
NISQA is a non-intrusive method, meaning it does not require clean reference data for the calculation.
There is no quality metric widely accepted by the industry or researchers to evaluate the performance of speech enhancement engines because each quality metric has challenges with specific distortion types. For example,
PESQ correlates with distortions introduced by telecommunication networks, making it a better choice to measure the speech quality of speech enhancement engines. Yet it still fails to predict the effects of background distortion compared to signal distortion. Thus, it’s crucial to understand the shortcomings of each measure to make a well-informed decision. Besides speech quality measures, researchers use speech intelligibility metrics such as
Speech Intelligibility Index (SII),
Speech Transmission Index (STI), and
Short-Time Objective Intelligibility (STOI). We have another article reviewing the speech intelligibility metrics.
Alternatively, you can score the quality by listening to the audio files, similar to
MOS. Below, there’s a short audio clip showing the difference between the no-noise suppression engine, with Mozilla RNNoise, and Koala Noise Suppression engine scenarios.
Also, you can integrate Koala Noise Suppression with a few lines of code and evaluate its performance by using
MOS with a few users.