I am working on a convolutional neural net which takes an audio spectrogram to discriminate between music and speech using the GTZAN dataset
If single samples are shorter, then this gives more samples overall. But if samples are too short, then they may lack important features?
How much data is needed for recognizing if a piece of audio is music or speech?
How long should the audio samples be ideally?