Emotion Recognition is the field of automatically detecting human emotions.
Speech Emotion Recognition is a subfield of it that focuses on spoken signals. For example, given an audio file, decide if the speaker is angry, neutral, or happy.
Connection to Sentiment Analysis
Speech Emotion Recognition is related to
Sentiment Analysis. But the latter is a subtopic of NLP focused on inferring the sentiment from text, not the speech signal. Speech Emotion Recognition and Sentiment Analysis together can capture semantic and vocal emotion.
Speech Emotion Recognition can help call centres, customer service, and voice assistants. It can provide real-time feedback (e.g. sales coach to a salesperson) or large-scale analytics (e.g. finding the percentage of frustrated customers).
There is a large body of research related to Speech Emotion Recognition. But no readily available commercial offering at the time of this writing. Below we list the challenges we foresee in building an enterprise-grade Speech Emotion Recognition system.
There are several free and paid datasets available for Speech Emotion Recognition. But there are shortcomings associated with them.
- These datasets are extremely limited in covering languages, speakers, genders, ages, dialects, etc. The problem of training, validating, and testing on a limited dataset is the abundance of overconfidence (overfitting) and hard failure in the field.
- There is no standard set of labels for human emotions! These datasets all use a slightly different set of emotions.
- Being angry in an online game differs from being angry with a customer service agent. A global definition of what an emotion looks like seems somewhat futile. Check the image of
Feelings Wheelbelow for a visual explanation.
Curating a Dataset
The direct approach is to collect and label the training examples. It is best if you have a way to keep collecting data beyond the first version of the dataset. Why? You will certainly need more data after your first training run. You might also want to invest in building a learning loop so you can keep gathering and labelling data from models in production. This method is expensive and time-consuming.
If you don't have the means to collect a dataset, you need to get creative! You can use crowdsourcing, which is much cheaper than building an in-house labelling team. But still, you pay per label. The last measure is to use an already-trained model to create labels for you. For example, given a computer vision system that understands emotions from facial expressions, you can snip what the user is saying at the time and attach the label to it.
Training a model
Assuming you have enough data, you can train a classifier end-to-end that maps from a speech signal to one of the labels. In practice, we won't be able to, and likely overfit to train and fail to generalize. Things we can try:
Data augmentation— Add noise, reverberation, speech perturbation, and any other valid audio manipulation technique.
Multi-task learning— Auxiliary tasks help the network to learn helpful hidden states. For example, you can also ask the model to transcribe what the user is saying.
Transfer learning— A feature embedder like Wav2vec to preprocess the input reduces the number of free parameters needed for training.
Feature Engineering— Instead of feeding raw audio for low-level features such as spectrogram, use lower-dimensional features such as pitch.