0

Recently, I and my partner developed a chord recognition tool using a neural network for research. For input, we are using the results from a pitch class profile.

There are 12 inputs as representations of each pitch class. The output is 5 nodes. We train the neural network based on input such as:

for chord c major: input: 1 0 0 0 1 0 0 1 0 0 0 0 and output: 1 0 0 0 0.

When we test it using c major.wav, the actual input from the result of the pitch class profile method shows the good result. The 3 basic notes of the c major are more dominant compared with the other notes, but the value is too small, i.e. :

c: 0.7123345
c#: 0.00024521
d:0.00013312
e: 0.009123
f:0.445023
f#:0.0535852
g:0.000312
g#:0.51023
a:0.0002312
a#:0.1034
b:0.003122
b#:0.000102

If we check it manually, we can see that c, f,and g are dominant as it should be, but when we check it using neural networks, the result is not as we desired. What can we do to improve this?

Matthieu Brucher
  • 21,634
  • 7
  • 38
  • 62
wendy0402
  • 495
  • 2
  • 5
  • 16
  • What exactly are the output nodes supposed to represent? Can you show some more training input? What is the actual output of the network (you only say it is not as desired)? Did you implement the neural network yourself or did you use a library? Did the network converge for training data? What were the other parameters (number of hidden nodes/layers etc.)? My initial thought is, that you probably should use more realistic training. (As in pitch profiles from actual files) –  Jan 14 '14 at 16:44
  • here are some of my training input Chords Input Code (C C# D D# E F F# G G# A A# B) Output Code C major 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 C# major 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 D major 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 1 D# major 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 E major 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 F major 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 the actual output is the output code, I implement the neural network my self, but I tried it too using neuroph, and the result is same – wendy0402 Jan 14 '14 at 17:24
  • continue: my parameters are: hidden layer:1 hidden node:13 learning rate:0.1 momentum:0.7 max error:0.02 if we try using realistic training, does it mean we should do pre processing first such as fft? if yes, then there will be more than one frame, which mean the input will so many – wendy0402 Jan 14 '14 at 17:28
  • It would probably work better if you used one output node per possible output chord or one continuous output node, where ranges correspond to the chord. With realistic training data I mean data such as the C major chord profile you showed above. The network has to be trained for the difficult-to-decide cases, not the obvious ones. For the obvious ones you wouldn't need the network. Many input cases is also better than overtraining on a few uninteresting ones. –  Jan 14 '14 at 17:47
  • The problem is I get the c major chord profile as I showed above from a wav file, which means there are a lot of data related to it(because of the frame). Then, how can I use it as training input? – wendy0402 Jan 14 '14 at 18:46
  • Well you somehow (probably per fft as you mentioned) calculated the pitch profile for the file. So you can do the same for different files right? What should ultimately be the input anyway? I also don't understand what you mean by `frame`. –  Jan 14 '14 at 19:38
  • When we process chord C Major.wav for example, it will go through stft process, which the result will be displayed in array form. Each array block contains one frame. Each frame represents output for few miliseconds. So, from chord C Major file input with duration of three seconds, it will produce approximately 60 frames. How do we use those frames for training while not all the frames show chord C Major output? – wendy0402 Jan 15 '14 at 03:43
  • You go through them and assign the expected chord to them manually one by one. If a frame's chord is even undecidable for you, then you drop it from the training set. –  Jan 15 '14 at 12:23

0 Answers0