RNN for Speech Emotion Recognition

Question

I want to classify speech data into four different emotions (angry, sad, happy, neutral).

The problem is that when I run RNN code, all speech data classified into one class.

(For example, all speech data classified as "angry" all the time.)

I don't know what is the reason for this problem and what I have to change for training.

Here's my tensorflow RNN main function for training and calculating accuracy:

def RNN(x, weights, biases, lstm_size):

    lstm_cell = []

    for i in range(lstm_size):
        lstm_cell.append(rnn.BasicLSTMCell(hidden_dim, forget_bias=1.0, state_is_tuple=True, activation=tf.nn.sigmoid))
    stacked_lstm = tf.contrib.rnn.MultiRNNCell(lstm_cell, state_is_tuple=True)
    outputs, states = tf.nn.dynamic_rnn(stacked_lstm, x, dtype=tf.float32)
    foutput = tf.contrib.layers.fully_connected(outputs[:,-1], output_dim, activation_fn = None)

    return foutput

logits = RNN(X, weights, biases, lstm_size)
prediction = tf.nn.sigmoid(logits)
cost =tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=Y))  
learning_rate =tf.train.exponential_decay(learning_rate=initial_learning_rate, global_step=training_steps, decay_steps=training_steps/10, decay_rate=0.96, staircase=True)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate) 
train_op = optimizer.minimize(cost)

pred = tf.argmax(prediction, axis=1)
label = tf.argmax(Y, axis=1)
correct_pred = tf.equal(pred, label)

accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float))

Input for RNN is speech features(pitch and MFCC) and output for RNN is one-hot code.(For example, angry=[1,0,0,0]).

Also, I wonder whether it is right or not to calculate classification accuracy like this.

If only one class is true per example, you should use softmax, not sigmoid. — xdurch0, Apr 02 '18 at 10:50
However, I read an article about multi-class classification and it said for multi-classification I have to use sigmoid. Is it wrong? — SON.PARK, Apr 03 '18 at 05:08
You need sigmoid if you have non-exclusive outputs. That is, if your outputs could e.g. be [1, 0, 1, 1]. However, you say that the RNN outputs a one-hot vector, i.e. only one 1 can be in there at all times. In that case, you should use softmax. — xdurch0, Apr 03 '18 at 07:58
could I see a sample of your dataframe and definition of bias? — abdoulsn, Oct 20 '19 at 14:23

RNN for Speech Emotion Recognition

0 Answers0