Measuring Word Error Rate for Beginners

WER is a commonly used voice AI terminology. “What’s your WER?” and “Our WER is X%.” are a part of initial conversations between speech-to-text vendors and buyers. In this article, we’ll explain the basics of WER.

What is WER?

WER stands for Word Error Rate. It measures the accuracy of a speech-to-text (STT) solution. As the name suggests, it shows how many errors STT produces compared to the text with no errors. Thus, the lower the WER, the higher the accuracy.

Why is WER important?

Accuracy is one of the most crucial factors for choosing STT. It simply indicates how well a solution will work for you. Every vendor claims “the best” or “the highest accuracy.” We haven’t seen anybody claiming “mediocre” accuracy. WER is a scientific method to evaluate “the best” speech-to-text solutions. Thus, executives can make data-driven decisions rather than claim-based.

How to calculate WER?

WER has a simple math formula. Sum of Errors divided by the total number of words. The only nuance is that errors include Substitutions (S), Insertions (I) and Deletions (D). WER cannot be a negative number, but it could be above 100%.

Let’s have a simple example:

“I am now going to bed.”

The total number of words = 6

STT Model 1: I am now going to bed.
WER = 0% (Sum of Errors: 0)

STT Model 2: I am now to bed.
WER = 16.7% (Sum of Errors: 1, Deletion = 1: going)

STT Model 3: I am now going to dad.
WER = 16.7% (Sum of Errors: 1, Substitution = 1 dad)

STT Model 4: I am now going to the bed.
WER = 16.7% (Sum of Errors: 1, Insertion = 1: the)

Model 1 is the same as the intended text. Therefore its WER is 0, and its accuracy is 100%.
Model 2, Model 3 and Model 4 have a WER of 16.7%, despite different error types.
Model 2 is an example of a Deletion. Model 3 is an example of Substitution, and Model 4 is an example of Insertion.
Model 2 makes the transcribed text unclear by deleting the verb. Model 3 changes the meaning by substituting the word bed with dad. The result of Model 4 is not grammatically correct, yet, it doesn't affect the meaning.

Another example:

“My name is Mary and I am a writer.”

STT Model 1: My name is Mary end I am a writer.
WER = 33.3% (Sum of Errors: 2, S = 1: end & D = 1: a)

STT Model 2: My name is Mary and I am a rider.
WER = 16.7% (Sum of Errors: 1, S = 1: rider)

Model 1 has a higher WER than Model 2. However, Model 2 changes the meaning of the sentence.

WER is the best and most widely accepted method to measure speech-to-text accuracy. However, we cannot take any WER at face value. The interpretation of WER has nuances. Check out the other critical things you should know about WER.

Measuring Speech-to-Text Accuracy: Word Error Rate Explained

What is WER?

Why is WER important?

How to calculate WER?

More from Picovoice