I am working on a CNN classification problem:
Using CNN to classify audio emotions into 6 classes (anger, disgust, fear, happy, sad, neutral). I am using the EMODB dataset, and the input features are Fourier transforms [256*36]. And my CNN network has 3-4 convolutional layers with max-pooling to each conv, plus one fully connected layer. But the learning curve shows a large gap between training and validating loss, indicating severe overfitting. And the best validating accuracy I can get is always between 75% to 80%.
learning curve Here is one of the learning curve I got. The black and blue ones are training accuracy and loss, respectively. The other two for validation accuracy and loss. The validation result just don't improve anymore even when the training loss goes to 0.
I have tried to augment my data set, add 50% dropout to the fc layer, add l2 regularization to the fc layers, and use learning rate decay policy (as 'inv' in caffe). But the gap still remains.
Is it because my dataset is too small?
I have around 500 audios in total, and extend it to around 6,000 samples. But when I increased my data to 15,000, the gap is still large. Is 15,000 still a small dataset for CNN?
Would it be because that the data augmentation process introduced error?
My original dataset consists of around 500 audios of different lengths, ranging from 1s to 6s. So I just randomly extracted samples of 1.2s. The longer the durations is, the more samples I get. And I can have more than 15,000 samples for training now. I was thinking that for long audios, a sample of 1.2s will lose too much information, and may not be able to represent the feature of the corresponding emotions. But this is the best method I can come up with, because I cannot use RNN or HMM to deal with the data for some reason.
Would it be possible that my feature computation went wrong? (Even though I have check it several times) I have also tried MFSC features [120*40], but the two feature set have similar overfitting problem..
Or is it because my network is not good enough? I thought more complicated networks will introduce more overfitting.. But the simple ones didn't show good performance.
Even though I list many reasons for the overfitting, I cannot figure out which ones are the real factors that influenced the performance. Is there any way to know which part went wrong? Or any advice to reduce overfitting?
Thank you!