Using Speech Recognition
for Language Learning
seems like a reasonable thing to do, right? Instead of someone
listening to you and telling you if you pronounced it right, you can use AI to tell you how well you did. Even better,
what to do to make it better.
Things to Avoid
Don't use a Speech-to-Text
engine! Let me expand on it. We have seen people trying to use a Speech-to-Text API to evaluate
how well someone pronounced a phrase or word. They use the transcription output to measure if the learner uttered the test
phrase correctly. This method is extremely noisy because Speech-to-Text is not just an Acoustic Model
. It also has a
Language Model
. i.e. it does not measure how well you pronounced something in isolation.
Easy Starts
The easy solution is to measure how close the pronunciation is to how a native speaker says it. Imagine that you want to see how well someone pronounces "Umbrella". Here is how to do it:
- Get a native speaker to say the word
- Get the learner to say the word
- Measure the distance between two utterances
Measuring the distance can be a bit tricky. We need to transform speech into a domain independent of the speaker and
representative of pronunciation. Mel-Frequency Cepstrum Coefficients
(MFCCs
) are a good candidate. Then measure the
difference between the sequence of MFCCs using a method like Dynamic Time Warping
(DTW
).
In practice, you need to have more than one reference speaker to account for differences in pronunciations of native speakers.
Going Deep
Maybe a word-level comparison is too coarse. What if we can compare individual sounds? i.e. Phonemes. Then when the learner says a test word, the AI can predict phonemes and compare them to what you expect from a native speaker (i.e. whatever is in the dictionary). All you need is a model that understands phonemes and a pronunciation dictionary.
Giving Feedback
It would be helpful to get corrective directions from AI when the learner doesn't get a good score. If we collect misclassifications over time, we can put them in buckets (i.e. cluster them). A subject matter expert can then attach advice to each bucket.