Using Speech Recognition for Language Learning seems like a reasonable thing to do, right? Instead of someone listening to you and telling you if you pronounced it right, you can use AI to tell you how well you did. Even better, what to do to make it better.

Things to Avoid

Don't use a Speech-to-Text engine! Let me expand on it. We have seen people trying to use a Speech-to-Text API to evaluate how well someone pronounced a phrase or word. They use the transcription output to measure if the learner uttered the test phrase correctly. This method is extremely noisy because Speech-to-Text is not just an Acoustic Model. It also has a Language Model. i.e. it does not measure how well you pronounced something in isolation.

Speech-to-text is a powerful form of Speech Recognition. It is not the only technology in the field and is not the right fit for language learning.

Easy Starts

The easy solution is to measure how close the pronunciation is to how a native speaker says it. Imagine that you want to see how well someone pronounces "Umbrella". Here is how to do it:

  • Get a native speaker to say the word
  • Get the learner to say the word
  • Measure the distance between two utterances

Measuring the distance can be a bit tricky. We need to transform speech into a domain independent of the speaker and representative of pronunciation. Mel-Frequency Cepstrum Coefficients (MFCCs) are a good candidate. Then measure the difference between the sequence of MFCCs using a method like Dynamic Time Warping (DTW).

In practice, you need to have more than one reference speaker to account for differences in pronunciations of native speakers.

Going Deep

Maybe a word-level comparison is too coarse. What if we can compare individual sounds? i.e. Phonemes. Then when the learner says a test word, the AI can predict phonemes and compare them to what you expect from a native speaker (i.e. whatever is in the dictionary). All you need is a model that understands phonemes and a pronunciation dictionary.

Giving Feedback

It would be helpful to get corrective directions from AI when the learner doesn't get a good score. If we collect misclassifications over time, we can put them in buckets (i.e. cluster them). A subject matter expert can then attach advice to each bucket.