4

I am trying to classify audio signals from speech to emotions. For this purpose I am extracting MFCC features of the audio signal and feed them into a simple neural network (FeedForwardNetwork trained with BackpropTrainer from PyBrain). Unfortunately the results are very bad. From the 5 classes the network seems to almost always come up with the same class as a result.

I have 5 classes of emotions and around 7000 labeled audio files, which I divide so that 80% of each class are used to train the network and 20% to test the network.

The idea is to use small windows and extract the MFCC features from those to generate a lot of training examples. In the evaluation all windows from one file are evaluated and a majority vote decides the prediction label.

Training examples per class: 
{0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182}

Example of scaled MFCC features:
[ -6.03465056e-01   8.28665733e-01  -7.25728303e-01   2.88611116e-05
1.18677218e-02  -1.65316583e-01   5.67322809e-01  -4.92335095e-01   
3.29816126e-01  -2.52946780e-01  -2.26147779e-01   5.27210979e-01   
-7.36851560e-01]

Layers________________________:  13 20 5 (also tried 13 50 5 and 13 100 5)
Learning Rate_________________:  0.01 (also tried 0.1 and 0.3)
Training epochs_______________:  10  (error rate does not improve at all during training)

Truth table on test set:
[[   0.    4.    0.  239.   99.]
 [   0.   41.    0.  157.   23.]
 [   0.   18.    0.  173.   18.]
 [   0.   12.    0.  299.   59.]
 [   0.    0.    0.   85.  132.]]

Success rate overall [%]:  34.7314201619
Success rate Class 0 [%]:  0.0
Success rate Class 1 [%]:  18.5520361991
Success rate Class 2 [%]:  0.0
Success rate Class 3 [%]:  80.8108108108
Success rate Class 4 [%]:  60.8294930876

Ok, now, as you can see the distribution of the results over the classes is very bad. Class 0 and 2 are never predicted. I assume, that this hints to a problem with either my network or more probably my data.

I could post a lot of code here, but I think it makes more sense to show in the following image all the steps I am taking to get to the MFCC features. Please be aware that I use the whole signal without windowing just for illustration. Does this look ok? The MFCC values are very huge, shouldn't they be much smaller? (I scale them down before feeding them into the network with a minmaxscaler over all the data to [-2,2], also tried [0,1])

Steps from signal to MFCC

This is the code I use for the Melfilter bank which I apply directly before a discrete cosine transformation to extract the MFCC features (I got it from here: stackoverflow):

def freqToMel(freq):
  '''
  Calculate the Mel frequency for a given frequency 
  '''
  return 1127.01048 * math.log(1 + freq / 700.0)

def melToFreq(mel):
  '''
  Calculate the frequency for a given Mel frequency 
  '''
  return 700 * (math.exp(freq / 1127.01048 - 1))

def melFilterBank(blockSize):
  numBands = int(mfccFeatures)
  maxMel = int(freqToMel(maxHz))
  minMel = int(freqToMel(minHz))

  # Create a matrix for triangular filters, one row per filter
  filterMatrix = numpy.zeros((numBands, blockSize))

  melRange = numpy.array(xrange(numBands + 2))

  melCenterFilters = melRange * (maxMel - minMel) / (numBands + 1) + minMel

  # each array index represent the center of each triangular filter
  aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0
  aux = (numpy.exp(melCenterFilters * aux) - 1) / 22050
  aux = 0.5 + 700 * blockSize * aux
  aux = numpy.floor(aux)  # Arredonda pra baixo
  centerIndex = numpy.array(aux, int)  # Get int values

  for i in xrange(numBands):
    start, centre, end = centerIndex[i:i + 3]
    k1 = numpy.float32(centre - start)
    k2 = numpy.float32(end - centre)
    up = (numpy.array(xrange(start, centre)) - start) / k1
    down = (end - numpy.array(xrange(centre, end))) / k2

    filterMatrix[i][start:centre] = up
    filterMatrix[i][centre:end] = down

  return filterMatrix.transpose()

What can I do to get a better prediction result?

Community
  • 1
  • 1
cowhi
  • 2,165
  • 2
  • 17
  • 21
  • you'll probably have better luck over on dsp.stackexchange.com – jaket Aug 31 '15 at 05:38
  • can I just change this somehow? I think I might also try stats.stackexchange.com in regard to the neural network... – cowhi Aug 31 '15 at 10:14
  • Actually just tell me how you are testing them then we can find out the error you are doing or any problem with your code and what classifier you are using also matters. – user7289160 Dec 13 '16 at 05:56
  • I am sorry, this is already so much in the past that I don't even have the code anymore... It was just a small project for a class I was taking. The predictions wouldn't work, so I concluded that the MFCC features alone are not enough for the classification in this case. But thx for trying to help! :) – cowhi Dec 14 '16 at 10:46

1 Answers1

3

Here I made up an example of sex identification from speech. I used the Hyke-dataset1 for this example. It's just an quickly made example. If one would like to do serious sex idenfification, probably one could do much better. But in general the error rate decreases:

Build up data...
Train network...
Number of training patterns:  94956
Number of test patterns:      31651
Input and output dimensions:  13 2
Train network...
epoch:    0   train error: 62.24%   test error: 61.84%
epoch:    1   train error: 34.11%   test error: 34.25%
epoch:    2   train error: 31.11%   test error: 31.20%
epoch:    3   train error: 30.34%   test error: 30.22%
epoch:    4   train error: 30.76%   test error: 30.75%
epoch:    5   train error: 30.65%   test error: 30.72%
epoch:    6   train error: 30.81%   test error: 30.79%
epoch:    7   train error: 29.38%   test error: 29.45%
epoch:    8   train error: 31.92%   test error: 31.92%
epoch:    9   train error: 29.14%   test error: 29.23%

I used the MFCC implemenation from scikits.talkbox. Maybe the code below helps you. (Sex identification is surely a much easier task than emotion detection... Maybe you need more and different features.)

import glob

from scipy.io.wavfile import read as wavread
from scikits.talkbox.features import mfcc

from pybrain.datasets            import ClassificationDataSet
from pybrain.utilities           import percentError
from pybrain.tools.shortcuts     import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules   import SoftmaxLayer

def report_error(trainer, trndata, tstdata):
    trnresult = percentError(trainer.testOnClassData(), trndata['class'])
    tstresult = percentError(trainer.testOnClassData(dataset=tstdata), tstdata['class'])
    print "epoch: %4d" % trainer.totalepochs, "  train error: %5.2f%%" % trnresult, "  test error: %5.2f%%" % tstresult  

def main(auido_path, coeffs=13):
    dataset = ClassificationDataSet(coeffs, 1, nb_classes=2, class_labels=['male', 'female'])
    male_files = glob.glob("%s/male_audio/*/*_1.wav" % auido_path)
    female_files = glob.glob("%s/female_audio/*/*_1.wav" % auido_path)

    print "Build up data..."
    for sex, files in enumerate([male_files, female_files]):
        for f in files:
            sr, signal = wavread(f)
            ceps, mspec, spec = mfcc(signal, nwin=2048, nfft=2048, fs=sr, nceps=coeffs)
            for i in range(ceps.shape[0]):
                dataset.appendLinked(ceps[i], [sex])

    tstdata, trndata = dataset.splitWithProportion(0.25)
    trndata._convertToOneOfMany()
    tstdata._convertToOneOfMany()

    print "Number of training patterns: ", len(trndata)
    print "Number of test patterns:     ", len(tstdata)
    print "Input and output dimensions: ", trndata.indim, trndata.outdim

    print "Train network..."
    fnn = buildNetwork(coeffs, int(coeffs*1.5), 2, outclass=SoftmaxLayer, fast=True)
    trainer = BackpropTrainer(fnn, dataset=trndata, learningrate=0.005)

    report_error(trainer, trndata, tstdata)
    for i in range(100):
        trainer.trainEpochs(1)
        report_error(trainer, trndata, tstdata)

if __name__ == '__main__':
    main("/path/to/hyke/audio_data")


1 Azarias Reda, Saurabh Panjwani and Edward Cutrell: Hyke: A Low-cost Remote Attendance Tracking System for Developing Regions, The 5th ACM Workshop on Networked Systems for Developing Regions (NSDR).
Frank Zalkow
  • 3,850
  • 1
  • 22
  • 23
  • I switched to talkbox. The MFCC values look much better now, but I still get strange results. When I start training the network, the error stays more or less at the same value, goes down a little, and up a litte, not one direction. I don't get much better than 66% after 10 epochs from like 69% at the beginning. The results on the test look a little better, but still most of the predictions are in one class. – cowhi Aug 31 '15 at 20:42
  • 2
    Hard to say from an outsider perspective. Some thoughts (you may have considered them by yourself already...): **1.** For emotion detection your window size should sufficiently large. (1024 samples may be too short for emotions to reveal) **2.** Maybe you should add some energy and pitch related audio features. **3.** And of course tweak at the network paremters (Number of hidden units, learning rate, eventuelly momentum). **4.** If all that doesn't work read some papers about emotion detection. – Frank Zalkow Sep 01 '15 at 07:09
  • yes, you are right, I tried most of these things. Different window sizes and network parameters. Also I discarded the first feature to see if that improves something and tried to standardize and normalize the features. Nothing gave me much. The only thing I didn't do is add more features. But if I am doing everything right, and I now think I do, then this means that the MFCC features alone are just not suited for this kind of classification. That is a result I can live with and which is sufficient for my report. Thx for taking the time to think about this and the feedback! – cowhi Sep 01 '15 at 13:38
  • You're welcome! What kind of dataset do you use? Is it publicly available? – Frank Zalkow Sep 01 '15 at 17:43
  • I am using the IEMOCAP dataset (http://sail.usc.edu/iemocap/index.html). It's not public, but I asked and they sent me a download link. I talked to my professor and he thinks, that the subset of emotions I am using is probably too related, like sad and frustrated, and that I am using a too narrow window for the MFCC. Haven't verified this yet, though. – cowhi Sep 02 '15 at 17:06