1

I am working on a CNN classification problem:
Using CNN to classify audio emotions into 6 classes (anger, disgust, fear, happy, sad, neutral). I am using the EMODB dataset, and the input features are Fourier transforms [256*36]. And my CNN network has 3-4 convolutional layers with max-pooling to each conv, plus one fully connected layer. But the learning curve shows a large gap between training and validating loss, indicating severe overfitting. And the best validating accuracy I can get is always between 75% to 80%.

learning curve Here is one of the learning curve I got. The black and blue ones are training accuracy and loss, respectively. The other two for validation accuracy and loss. The validation result just don't improve anymore even when the training loss goes to 0.

I have tried to augment my data set, add 50% dropout to the fc layer, add l2 regularization to the fc layers, and use learning rate decay policy (as 'inv' in caffe). But the gap still remains.

Is it because my dataset is too small?
I have around 500 audios in total, and extend it to around 6,000 samples. But when I increased my data to 15,000, the gap is still large. Is 15,000 still a small dataset for CNN?

Would it be because that the data augmentation process introduced error?
My original dataset consists of around 500 audios of different lengths, ranging from 1s to 6s. So I just randomly extracted samples of 1.2s. The longer the durations is, the more samples I get. And I can have more than 15,000 samples for training now. I was thinking that for long audios, a sample of 1.2s will lose too much information, and may not be able to represent the feature of the corresponding emotions. But this is the best method I can come up with, because I cannot use RNN or HMM to deal with the data for some reason.

Would it be possible that my feature computation went wrong? (Even though I have check it several times) I have also tried MFSC features [120*40], but the two feature set have similar overfitting problem..

Or is it because my network is not good enough? I thought more complicated networks will introduce more overfitting.. But the simple ones didn't show good performance.

Even though I list many reasons for the overfitting, I cannot figure out which ones are the real factors that influenced the performance. Is there any way to know which part went wrong? Or any advice to reduce overfitting?

Thank you!

  • You might try using learning curves to diagnose the problem more clearly. https://class.coursera.org/ml-003/lecture/64 – John Yetter Jun 29 '16 at 20:04
  • you described how you generate training data. What about testing? Is it from separate recordings, also cut to small chunks? What is the relation between the recordings? Do they come from different speakers? Different times? – lejlot Jun 29 '16 at 23:06
  • @JohnYetter Yes, I used learning curves. My x axis is the number of epoches. The validation accuracy just kept almost unchanged around 0.7 to 0.8, or even goes down after that. So is the validation loss. I've added a figure in the post. –  Jun 30 '16 at 08:32
  • @lejlot For training and validation, the data are generated in the same way, but from seperate audios. For testing, I used the middile 1.2s, and the precision is close to validation precision. I've also tried to crop several samples of 1.2s from one audio, and output the major predicting labels as final prediction. The precision can be improved a bit, but still far from good. In this way my test audios are limited. All the audios are picked randomly from the same dataset. The dataset contains audios of different persons, each one repeating several sentences with different emotions. –  Jun 30 '16 at 08:37
  • Any way to get a learning curve plot comparing loss versus size of training set, as opposed to training epochs? That is often instructive. You mention that you used regularization. How much did that help? Finally, could you get a person to take your 1.2 second set and categorize a reasonable sample. 1.2 seconds is not a lot of information, and since it is randomly pulled from a longer clip, it might not contain clear indication of the emotion. If a human gets it right 80% of the time, you might be up against a limit of the format of the clips. – John Yetter Jun 30 '16 at 15:51
  • Also, have you done error analysis to see which results you are getting wrong answers for and what the incorrect guess is? If you collect the clips that you are getting wrong answers for, can you listen to them and get the correct answer? – John Yetter Jun 30 '16 at 15:56
  • @JohnYetter Before regularization, the validation loss first decreased along with training loss, and then went up and reached really high. Regularization helped it remain almost unchanged rather than going up, and also more stable with smaller variance, and the highest accuracy from 65 to 70~75, but no more than that. I haven't thought of changing the size of training set. It's more reasonable than just looking at the curves for one set. I will try to check this learning curve. –  Jul 01 '16 at 09:59
  • @JohnYetter Would the 1.2s-problem be possible to lead to such kind of performance? The lengths of most audios (I would say 70%) spread almost uniformly over 1.5s to 3s, maybe a larger length would help, like 2s? Anyway, I will also try check the wrong answers as well as the clips. Thank you! –  Jul 01 '16 at 10:01

1 Answers1

0

You can try adding some 'dropout' layers in your CNN and see whether it reduces the overfitting. -- Venkat

Venkat
  • 5
  • 1
  • Do you have any evidence supporting your claim? – sg7 Mar 26 '18 at 02:29
  • I have worked and working on CNN's for my research and yes I have seen dropout layers reduce overfitting in several of my designs. Plus dropouts are a kind of regularization technique. https://datascience.stackexchange.com/questions/22494/convolutional-neural-network-overfitting-dropout-not-helping. Several answers in this link talk about it too. – Venkat Mar 28 '18 at 01:52