WER is a commonly used voice AI terminology. “What’s your WER?” and “Our WER is X%.” are a part of initial conversations between speech-to-text vendors and buyers. In this article, we’ll explain the basics of WER.

## What is WER?

WER stands for Word Error Rate. It measures the accuracy of a speech-to-text (STT) solution. As the name suggests, it shows how many errors STT produces compared to the text with no errors. Thus, the lower the WER, the higher the accuracy.

## Why is WER important?

Accuracy is one of the most crucial factors for choosing STT. It simply indicates how well a solution will work for you. Every vendor claims “the best” or “the highest accuracy.” We haven’t seen anybody claiming “mediocre” accuracy. WER is a scientific method to evaluate “the best” speech-to-text solutions. Thus, executives can make data-driven decisions rather than claim-based.

## How to calculate WER?

WER has a simple math formula. Sum of Errors divided by the total number of words. The only nuance is that errors include Substitutions (S), Insertions (I) and Deletions (D). WER cannot be a negative number, but it could be above 100%. Let’s have a simple example:

“I am now going to bed.”

The total number of words = 6

STT Model 1: I am now going to bed.
WER = 0%     (Sum of Errors: 0)

STT Model 2: I am now bed.
WER = 16.7% (Sum of Errors: 1, Deletion = 1: going)

STT Model 3: I am now going to dad.
WER = 16.7% (Sum of Errors: 1, Substitution = 1 dad)

STT Model 4: I am now going to the bed.
WER = 16.7% (Sum of Errors: 1, Insertion = 1: the)

• Model 1 is the same as the intended text. Therefore its WER is 0, and its accuracy is 100%.
• Model 2, Model 3 and Model 4 have a WER of 16.7%, despite different error types.
• Model 2 is an example of a Deletion. Model 3 is an example of Substitution, and Model 4 is an example of Insertion.
• Model 2 makes the transcribed text unclear by deleting the verb. Model 3 changes the meaning by substituting the word bed with dad. The result of Model 4 is not grammatically correct, yet, it doesn't affect the meaning.

Another example:

“My name is Mary and I am a writer.”

STT Model 1: My name is Mary end I am a writer.
WER = 33.3% (Sum of Errors: 2, S = 1: end & D = 1: a)

STT Model 2: My name is Mary and I am a rider.
WER = 16.7% (Sum of Errors: 1, S = 1: rider)

Model 1 has a higher WER than Model 2. However, Model 2 changes the meaning of the sentence.

WER is the best and most widely accepted method to measure speech-to-text accuracy. However, we cannot take any WER at face value. The interpretation of WER has nuances. Check out the other critical things you should know about WER.