0

I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset

If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?

How much data is needed for recognizing if a piece of audio is music or speech?

How long should the audio samples be ideally?

Community
  • 1
  • 1
Androbin
  • 991
  • 10
  • 27

2 Answers2

3

The length of audios vary on number of factors.

The basic idea is to get just enough samples.
Since audio changes constantly, it is preferred to work on a shorter data. However, very small frame would result into less/no feature to be captured.

On the other hand very large sample would capture too many features, thereby leading to complexity. So, in most usecases, although the ideal audio length is 25seconds, but it is not a written rule and you may manipulate it accordingly.Just make sure the frame size is not very small or very large.

Update for dataset Check this link for dataset of 30s

user5722540
  • 590
  • 8
  • 24
skull
  • 125
  • 9
1

How much data is needed for recognizing if a piece of audio is music or speech?

If someone knew the answer to this question exactly then the problem would be solved already :) But seriously, it depends on what your downstream application will be. Imagine trying to discriminate between speech with background music vs acapella singing (hard) or classifying orchestral music vs audio books (easy).

How long should the audio samples be ideally?

Like everything in machine learning, it depends on the application. For you, I would say test with at least 10, 20, and 30 secs, or something like that. You are correct in that the spectral values can change rather drastically depending on the length!

user3658307
  • 761
  • 1
  • 7
  • 23
  • Referring to the last section: Is there significant information loss when normalizing every time segment in respect to spectral values? – Androbin Mar 05 '17 at 17:00
  • Of course, experimentation is important. I am just curious about common values used for such tasks. Like, is there some empirical head-start-value? – Androbin Mar 05 '17 at 17:02
  • 1
    @Androbin depends how you are norming it. This is also a tradeoff between signal and noise; i.e. whether there is more signal or noise in what you are normalizing away. I don't know the details of your task, but keep in mind that long windows = better frequency resolution. So I'd aim for larger windows, to get better discrimination in your case. Keep in mind that CNNs are *designed* to take your ugly messy data and figure out how to deal with it, so don't worry about preprocessing too much... In computer vision, we now often feed raw images directly ... – user3658307 Mar 05 '17 at 19:03
  • 1
    @Androbin by the way, did you know there is a DSP stack exchange? e.g. see [here](http://dsp.stackexchange.com/questions/14003/normalizing-a-spectrogram-or-a-pitch-class-profile) and [here](http://dsp.stackexchange.com/questions/1262/creating-a-spectrogram?rq=1). You may get more useful info there. :) – user3658307 Mar 05 '17 at 19:04