I want to classify speech data into four different emotions (angry, sad, happy, neutral).
The problem is that when I run RNN code, all speech data classified into one class.
(For example, all speech data classified as "angry" all the time.)
I don't know what is the reason for this problem and what I have to change for training.
Here's my tensorflow RNN main function for training and calculating accuracy:
def RNN(x, weights, biases, lstm_size):
lstm_cell = []
for i in range(lstm_size):
lstm_cell.append(rnn.BasicLSTMCell(hidden_dim, forget_bias=1.0, state_is_tuple=True, activation=tf.nn.sigmoid))
stacked_lstm = tf.contrib.rnn.MultiRNNCell(lstm_cell, state_is_tuple=True)
outputs, states = tf.nn.dynamic_rnn(stacked_lstm, x, dtype=tf.float32)
foutput = tf.contrib.layers.fully_connected(outputs[:,-1], output_dim, activation_fn = None)
return foutput
logits = RNN(X, weights, biases, lstm_size)
prediction = tf.nn.sigmoid(logits)
cost =tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=Y))
learning_rate =tf.train.exponential_decay(learning_rate=initial_learning_rate, global_step=training_steps, decay_steps=training_steps/10, decay_rate=0.96, staircase=True)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(cost)
pred = tf.argmax(prediction, axis=1)
label = tf.argmax(Y, axis=1)
correct_pred = tf.equal(pred, label)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float))
Input for RNN is speech features(pitch and MFCC) and output for RNN is one-hot code.(For example, angry=[1,0,0,0]).
Also, I wonder whether it is right or not to calculate classification accuracy like this.