Noise suppression and speech enhancement are interchangeable. Noise suppression engines suppress noise, as the name suggests, resulting in enhanced speech. Researchers use Speech Quality and Speech Intelligibility metrics to measure speech enhancement. This article covers the most known and used speech quality metrics: Mean Opinion Score (MOS), Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Analysis (POLQA), 3-fold Quality Evaluation of Speech in Telecommunications (3QUEST), and Non-intrusive Objective Speech Quality Assessment (NISQA).

What’s the Mean Opinion Score (MOS)?

MOS is a commonly used but subjective method for evaluating speech quality. It has a straightforward approach: Gather a diverse group of people (test subjects), ask them to listen to recordings, and rate the quality on a scale of 1 to 5. Then average the scores, hence the name mean opinion score. ITU offers some recommendations for designing, running, and reporting the experiments to minimize subjectivity. However, it is still a subjective assessment of sound quality (or impairment). Thus, It’s critical to understand the experiment conditions before the results, as they’re easy to manipulate.

MOS (Medium Opinion Score) Survey methodology and questions to assess speech enhancement quality.

Unipolar discrete five-grade scale alternatives to calculate MOS by ITU General Methods for the Subjective Assessment of Sound Quality

MOS is time-consuming and expensive. Minimizing bias and achieving more reliable results require tests with more listeners (test subjects) from diverse backgrounds, taking time and effort. Thus, automated tests that serve as computational proxies for MOS are popular, as well.

What’s Perceptual Evaluation of Speech Quality (PESQ)?

PESQ is a commonly used metric with the source code on the ITU website. It aims to measure speech quality after passing through the network and codec distortions, not directly for noise suppression quality. More importantly, it’s from the early 2000s and does not reflect the state-of-the-art. Therefore, ITU replaced it with POLQA.

What’s Perceptual Objective Listening Quality Analysis (POLQA)?

ITU released POLQA as the successor of PESQ. Similar to PESQ, POLQA is a measure of the quality after network and codec distortions. PESQ and POLQA use a computational approximation of MOS.

What’s 3-fold Quality Evaluation of Speech in Telecommunications (3QUEST)?

3QUEST is another metric recommended by ITU. It’s to measure the background noise in a transmitted signal. It also uses subjective determination, MOS, not just for speech but also noise and the overall quality of the sound. S-MOS refers to Speech-MOS, N-MOS to Noise-MOS, and G-MOS to General-MOS or Global-MOS.

Speech-MOS (S-MOS), Noise-MOS (N-MOS), and General or Global-MOS (G-MOS) methodology and questionnaire for the speech quality assessment

S-MOS, N-MOS, and G-MOS Evaluation Survey by ITU P.835 Subjective Test Methodology for Evaluating Speech Communication Systems that Include Noise Suppression Algorithm

What’s Non-intrusive Objective Speech Quality Assessment (NISQA)?

NISQA provides predictions on overall speech quality and speech quality dimensions: Noisiness, Coloration, Discontinuity, and Loudness. Unlike the intrusive models mentioned above, NISQA is a non-intrusive method, meaning it does not require clean reference data for the calculation.

There is no quality metric widely accepted by the industry or researchers to evaluate the performance of speech enhancement engines because each quality metric has challenges with specific distortion types. For example, PESQ correlates with distortions introduced by telecommunication networks, making it a better choice to measure the speech quality of speech enhancement engines. Yet it still fails to predict the effects of background distortion compared to signal distortion. Thus, it’s crucial to understand the shortcomings of each measure to make a well-informed decision. Besides speech quality measures, researchers use speech intelligibility metrics such as Speech Intelligibility Index (SII), Speech Transmission Index (STI), and Short-Time Objective Intelligibility (STOI). We have another article reviewing the speech intelligibility metrics.

Alternatively, you can score the quality by listening to the audio files, similar to MOS. Below, there’s a short audio clip showing the difference between the no-noise suppression engine, with Mozilla RNNoise, and Koala Noise Suppression engine scenarios.

Also, you can integrate Koala Noise Suppression with a few lines of code and evaluate its performance by using MOS with a few users.

Start Building