HMM vs Deep Learning for Speech Emotion Recognition (SER)

Question

For building Speech Emotion Detection and Recognition system, which approach would be better? Hidden Markov Model or Deep Learning (RNN-LSTM) approach? I have to build a SER system and I am confused between the two. If there are better models than these two, kindly tell.

Siraj's video is about "Speech Recognition" and you are asking about "Speech Emotion Recognition". Knowing what someone said and knowing the meaning of what they said are very different things. Please clarify your post. — Brian O'Donnell, Mar 25 '18 at 14:47
@BrianO'Donnell so I removed that part. My question is completely related to Speech Emotion Recognition. Sorry about that. — Saad, Mar 25 '18 at 20:12

score 3 · Accepted Answer · answered Mar 25 '18 at 20:44

3

HMM and RNN-LSTM based solutions are not considered highly accurate for SER. I believe the ranking algorithm to date is one based on Deep Retinal Convolution Neural Networks (DRCNNs). See Speech emotion recognition using Deep Retinal Convolution Neural Networks, authored by Niu, Yafeng; Zou, Dongsheng; Niu, Yadong; He, Zhongshi; Tan, Hua and published in July of 2017. The authors achieved an average accuracy over 99% on the following databases: IEMOCAP, EMO-DB, and SAVEE.

answered Mar 25 '18 at 20:44

Brian O'Donnell

1,836
19
29

Can you help me break down this approach in to simpler/easier steps? What I understood from this paper is that I first have to convert voices in to spectogram by using Data Augmentation Algorithm Based on Retinal Imaging Principle (DAARIP) algorithm and then input these into DCNN. – Saad Mar 26 '18 at 07:18
Do you know how to train Alexnet in general? – Brian O'Donnell Mar 28 '18 at 01:41

score 0 · Answer 2 · answered Dec 07 '21 at 13:30

In practice, it is affected by a variety of conditions such as:

Algorithmic complexity (in training or testing mode), accuracy, or another confusion matrix measure?
How accurate are the annotations (labeled data is required for neural networks)?
Are you working on a low-resource language like Persian, Arabic , ... or the project is ASR on English containing huge amount of labeled data?
Is it necessary to know exactly what you're modeling?

You can perform it if you're willing to employ a poorly known structure like a deep learning neural network, which has hundreds of layers, thousands of parameters, and may require a lot more resources (Computation & Labeled Data) to train than a hidden Markov model (HMM).

HMM were formerly considered cutting-edge, but Deep Learning is now far more accurate.

Finally I can wrap it up like this : HMM is simpler to comprehend and use. Deep learning could take longer to compute, but the outcomes could be more promising.

HMM vs Deep Learning for Speech Emotion Recognition (SER)

2 Answers2