Automatic Speech Emotion Recognition using Deep learning


Automatic speech emotion recognition (ASER) is the application of recognizing the emotional aspects of speech irrespective of the semantic contents. While humans can efficiently perform this task as a natural part of speech communication, the ability to conduct it automatically using programmable devices is still an ongoing subject of research.

Studies of automatic emotion recognition systems aim to create efficient, real-time methods of detecting the emotions of mobile phone users, call center operators and customers, car drivers, pilots, and many other human-machine communication users. Adding emotions to machines has been recognized as a critical factor in making machines appear and act in a human-like manner. Real-time processing of speech needs a continually streaming input signal, rapid processing, and steady output of data within a constrained time, which differs by milliseconds from the time when the analyzed data samples were generated.


As human beings speech is amongst the most natural way to express ourselves. We depend so much on it that we recognize its importance when resorting to other communication forms like emails and text messages where we often use emojis to express the emotions associated with the messages. As emotions play a vital role in communication, the detection and analysis of the same is of vital importance in today’s digital world of remote communication. Emotion detection is a challenging task, because emotions are subjective.

There is no common consensus on how to measure or categorize them. We define a ASER system as a collection of methodologies that process and classify speech signals to detect emotions embedded in them. Such a system can find use in a wide variety of application areas like interactive voice based-assistant or caller-agent conversation analysis. In this study we attempt to detect underlying emotions in recorded speech by analyzing the acoustic features of the audio data of recordings.


AITA Approach is subject to the length of time needed to calculate the feature parameters. While the system training procedure can be time-consuming, it is a one-off task usually performed off-line to generate a set of class models. These models can be stored and applied at any time to perform the classification procedure for incoming sequences of speech samples. The classification process involves the calculation of feature parameters and model-based inference of emotional class labels. Since the inference is usually very fast (in the order of milliseconds), therefore if the feature calculation can be performed in a similarly short time, the classification process can be achieved in real-time.

Recent advancements in DL technologies for speech and image processing have provided particularly attractive solutions to ASER, since both, feature extraction and the inference procedures can be performed in real-time. Fine-tuned CNNs have been shown to ensure both high ASER accuracy and short inference time suitable for a real-time implementation.


The original dataset consists of around 50000 samples audio files in .wav file format with different categories like angry, happy, sad, frustration, disrupt etc. We split into three sets (i.e., training, testing, and validation set).from each audio file we extracted many common features such as energy, pitch, formant, and some spectrum features such as linear prediction coefficients, Mel-Frequency Cepstral Coefficients, and modulation special features. In this work, we have selected modulation spectral features to extract the emotional features.

We rearranged the entire data into training and validation set only. A total of 42000 audio datasets were allocated to the training set and 8000 audio datasets were assigned to the validation set to improve validation accuracy.




We have demonstrated how to classify different kinds of emotions by training our model on collection of audio databases like RAVDESS, TORONTO, SAVEE, CREMA etc. We built our model from scratch, which separates it from other methods that rely heavily on transfer learning approach. In the future, we are going to extend to train and predict few more emotions as part from our existing approach with decent results. This application of Deep Learning combined with RPA has great business applications to automatically drive intelligent workflows in realtime on customer service calls based on the emotion of the customer on the call. To learn more, please contact us at

Comments are closed.