How long should audio samples be for music/speech discrimination?

Question

I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset

If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?

How much data is needed for recognizing if a piece of audio is music or speech?

How long should the audio samples be ideally?

score 3 · Answer 1 · edited Aug 29 '17 at 11:22

3

The length of audios vary on number of factors.

The basic idea is to get just enough samples.
Since audio changes constantly, it is preferred to work on a shorter data. However, very small frame would result into less/no feature to be captured.

On the other hand very large sample would capture too many features, thereby leading to complexity. So, in most usecases, although the ideal audio length is 25seconds, but it is not a written rule and you may manipulate it accordingly.Just make sure the frame size is not very small or very large.

Update for dataset Check this link for dataset of 30s

edited Aug 29 '17 at 11:22

user5722540

590
8
24

answered Aug 29 '17 at 11:07

skull

125
9

25 seconds for music/speech discrimination? – Androbin Aug 29 '17 at 11:11
most datasets don't provide that much data anyway – Androbin Aug 29 '17 at 11:11
@Androbin, u mean classification? – skull Aug 29 '17 at 11:12
What's the difference in this context? – Androbin Aug 29 '17 at 11:13
2

I have worked on real time projects and usually i had to cut down and trim the audio files, because they are usually more than 4min. – user5722540 Aug 29 '17 at 11:14
As far as datasets are concerned, if u are really enthusiastic, may be use ur music playlist as an input for classification – user5722540 Aug 29 '17 at 11:15
I tried but didn't find a set that was homogeneous enough and not to heterogeneous – Androbin Aug 29 '17 at 11:18
i have updated the answer, to include a database of 30s audios – skull Aug 29 '17 at 11:22
do select the answer that has helped you(either one's), as it would help the community as well! – skull Aug 29 '17 at 11:28

score 1 · Answer 2 · answered Mar 05 '17 at 14:59

1

How much data is needed for recognizing if a piece of audio is music or speech?

If someone knew the answer to this question exactly then the problem would be solved already :) But seriously, it depends on what your downstream application will be. Imagine trying to discriminate between speech with background music vs acapella singing (hard) or classifying orchestral music vs audio books (easy).

How long should the audio samples be ideally?

Like everything in machine learning, it depends on the application. For you, I would say test with at least 10, 20, and 30 secs, or something like that. You are correct in that the spectral values can change rather drastically depending on the length!

answered Mar 05 '17 at 14:59

user3658307

761
1
7
23

Referring to the last section: Is there significant information loss when normalizing every time segment in respect to spectral values? – Androbin Mar 05 '17 at 17:00
Of course, experimentation is important. I am just curious about common values used for such tasks. Like, is there some empirical head-start-value? – Androbin Mar 05 '17 at 17:02
1

@Androbin depends how you are norming it. This is also a tradeoff between signal and noise; i.e. whether there is more signal or noise in what you are normalizing away. I don't know the details of your task, but keep in mind that long windows = better frequency resolution. So I'd aim for larger windows, to get better discrimination in your case. Keep in mind that CNNs are *designed* to take your ugly messy data and figure out how to deal with it, so don't worry about preprocessing too much... In computer vision, we now often feed raw images directly ... – user3658307 Mar 05 '17 at 19:03
1

@Androbin by the way, did you know there is a DSP stack exchange? e.g. see [here](http://dsp.stackexchange.com/questions/14003/normalizing-a-spectrogram-or-a-pitch-class-profile) and [here](http://dsp.stackexchange.com/questions/1262/creating-a-spectrogram?rq=1). You may get more useful info there. :) – user3658307 Mar 05 '17 at 19:04

How long should audio samples be for music/speech discrimination?

2 Answers2