I am trying to classify audio signals from speech to emotions. For this purpose I am extracting MFCC features of the audio signal and feed them into a simple neural network (FeedForwardNetwork trained with BackpropTrainer from PyBrain). Unfortunately the results are very bad. From the 5 classes the network seems to almost always come up with the same class as a result.
I have 5 classes of emotions and around 7000 labeled audio files, which I divide so that 80% of each class are used to train the network and 20% to test the network.
The idea is to use small windows and extract the MFCC features from those to generate a lot of training examples. In the evaluation all windows from one file are evaluated and a majority vote decides the prediction label.
Training examples per class:
{0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182}
Example of scaled MFCC features:
[ -6.03465056e-01 8.28665733e-01 -7.25728303e-01 2.88611116e-05
1.18677218e-02 -1.65316583e-01 5.67322809e-01 -4.92335095e-01
3.29816126e-01 -2.52946780e-01 -2.26147779e-01 5.27210979e-01
-7.36851560e-01]
Layers________________________: 13 20 5 (also tried 13 50 5 and 13 100 5)
Learning Rate_________________: 0.01 (also tried 0.1 and 0.3)
Training epochs_______________: 10 (error rate does not improve at all during training)
Truth table on test set:
[[ 0. 4. 0. 239. 99.]
[ 0. 41. 0. 157. 23.]
[ 0. 18. 0. 173. 18.]
[ 0. 12. 0. 299. 59.]
[ 0. 0. 0. 85. 132.]]
Success rate overall [%]: 34.7314201619
Success rate Class 0 [%]: 0.0
Success rate Class 1 [%]: 18.5520361991
Success rate Class 2 [%]: 0.0
Success rate Class 3 [%]: 80.8108108108
Success rate Class 4 [%]: 60.8294930876
Ok, now, as you can see the distribution of the results over the classes is very bad. Class 0 and 2 are never predicted. I assume, that this hints to a problem with either my network or more probably my data.
I could post a lot of code here, but I think it makes more sense to show in the following image all the steps I am taking to get to the MFCC features. Please be aware that I use the whole signal without windowing just for illustration. Does this look ok? The MFCC values are very huge, shouldn't they be much smaller? (I scale them down before feeding them into the network with a minmaxscaler over all the data to [-2,2], also tried [0,1])
This is the code I use for the Melfilter bank which I apply directly before a discrete cosine transformation to extract the MFCC features (I got it from here: stackoverflow):
def freqToMel(freq):
'''
Calculate the Mel frequency for a given frequency
'''
return 1127.01048 * math.log(1 + freq / 700.0)
def melToFreq(mel):
'''
Calculate the frequency for a given Mel frequency
'''
return 700 * (math.exp(freq / 1127.01048 - 1))
def melFilterBank(blockSize):
numBands = int(mfccFeatures)
maxMel = int(freqToMel(maxHz))
minMel = int(freqToMel(minHz))
# Create a matrix for triangular filters, one row per filter
filterMatrix = numpy.zeros((numBands, blockSize))
melRange = numpy.array(xrange(numBands + 2))
melCenterFilters = melRange * (maxMel - minMel) / (numBands + 1) + minMel
# each array index represent the center of each triangular filter
aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0
aux = (numpy.exp(melCenterFilters * aux) - 1) / 22050
aux = 0.5 + 700 * blockSize * aux
aux = numpy.floor(aux) # Arredonda pra baixo
centerIndex = numpy.array(aux, int) # Get int values
for i in xrange(numBands):
start, centre, end = centerIndex[i:i + 3]
k1 = numpy.float32(centre - start)
k2 = numpy.float32(end - centre)
up = (numpy.array(xrange(start, centre)) - start) / k1
down = (end - numpy.array(xrange(centre, end))) / k2
filterMatrix[i][start:centre] = up
filterMatrix[i][centre:end] = down
return filterMatrix.transpose()
What can I do to get a better prediction result?