Different FFT signal lengths for same length audio clips

Question

Currently I am working on a project that requires me to pick out audio clips and compare them based off their FFT results (i.e. spectrogram). All of my audio clips are 0.200s long, but when I process them through the transform, they are no longer the same length. The code I am using for the transform uses numpy and librosa libraries:

def extractFFT(audioArr):
    fourierArr = []
    fourierComplex = []
    for x in range(len(audioArr)):
        y, sr = lb.load(audioArr[x])
        fourier = np.fft.fft(y)
        fourier = fourier.real
        fourierArr.append(fourier)
     return fourierArr

I am only taking the real number portion of the transform because I also wanted to pass this through a PCA, which does not allow for complex numbers. Regardless, I can perform neither LDA (linear discriminant analysis) or PCA on this FFT array of audio clips, since some are of different lengths.

The code I have for the LDA is as follows, where the labels are given for a frequencyArr of length 4:

def LDA(frequencyArr):
    splitMark = int(len(frequencyArr)*0.8)
    trainingData = frequencyArr[:splitMark]
    validationData = frequencyArr[splitMark:]
    labels = [1,1,2,2]

    lda = LinearDiscriminantAnalysis()
    lda.fit(trainingData,labels[:splitMark])

    print(f"prediction: {lda.predict(validationData)}")

This throws the following value error, coming from the lda.fit(trainingData,labels[:splitMark]) line:

ValueError: setting an array element with a sequence.

I know this error stems from the array not being of a set 2 dimensional shape, since I don't receive this error when the FFT elements are all of equal length and the code works as intended.

Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!

Note, they normally only differ by a few points, say for 3 of the audio clips the FFT length is 4410 but for the 4th it is 4409. I know I can probably just trim the lengths down to the smallest length out of the group, but I'd prefer a cleaner method that won't leave out any values.

Hendrik · Accepted Answer · 2019-07-30T19:13:15.200

3

First of all: Do not only take the real part of the transform result. It won't do you any good. Use the power (r^2+i^2) or magnitude (sqrt(power)) to get the strength of the signal for a frequency bin.

Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!

They are simply not the same length. I bet the sample number of your clips isn't exactly identical.

After y, sr = lb.load(audioArr[x]) do print('sample count = {}'.format(len(y))) and you will most likely see different values (you've stated as much yourself).

As you already point out, of course you could simply cut of the signal at min(len(y)) and then feed it into the FFT. But typically, what you do to get around this is to use a discrete STFT, which has a fixed window size. This ensures same length input size to the FFT. You can use librosa's implementation as an easy starting point. The docs also explain how to get magnitude/power.

So instead of:

y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)

You do:

y, sr = lb.load(audioArr[x])
# get the magnitudes
D = np.abs(librosa.stft(y, n_fft=4096))  # use 4096 as window length
fourierArr.append(D[0])                  # only use the first frame of the STFT

In essence, if you use the Fourier transform with different length input, you will get different length output, which is something that LDA does not forgive, when using this output as training data. So you have to make sure your input has the same length. The easiest way to do this is to use the STFT (or simply cut all your input to min). IMO, there is nothing unclean about this and it will not affect results much, if you are missing a couple of samples.

edited Jul 30 '19 at 19:13

answered Jul 30 '19 at 07:01

Hendrik

5,085
24
56

Thank you for clarifying about the different audio lengths and how to resolve this situation. As for including the imaginary part of the transformation, does the code you included does this? Would I then be training on the magnitude/power of the audio clip instead of FFT or STFT? Or does implementing STFT eliminate the problem of imaginary numbers? Sorry if this question is unclear, I'm just not understanding how power/magnitude play a role after I switch it over to STFT. – Andrew Jul 30 '19 at 18:34
One other question/comment, what parameter is `n_fft=4096` when you are loading in the audio? I have looked at the librosa.core.load documentation and cannot find a parameter meeting that criteria. – Andrew Jul 30 '19 at 18:41
My apologies. That parameter was supposed to go into the `stft` call ([docs](https://librosa.github.io/librosa/generated/librosa.core.stft.html)). I updated the code sample. – Hendrik Jul 30 '19 at 19:14
No worries, appreciate the help/clarification! – Andrew Jul 30 '19 at 20:11
Sorry to bother again, but is there a reason I am only suppose to use the first frame as stated by doing `fourierArr.append(D[0])`? Does none of the other data points matter? I looked at the values being printed and saw they were all different, so I don't understand how you concluded upon just picking the first frame and forgetting the rest of the data points. If it would be easier, I could open another question to expand if this doesn't give enough information. – Andrew Jul 31 '19 at 05:02
1

`D[0]` corresponds to the first window, i.e. the first 4096 samples. `D[1]` corresponds to samples `1*hop_length` to `1*hop_length+n_fft` (`hop_length=n_fft/2`). You can also use the second value, in fact, if it suits you, you can do a per bin average (using `np.mean` with the right axis) or simply dump more data into your LDA. However, anything that's not in the first 4096 samples will be zero-padded, because your signal is short, and may not increase the quality of whatever you want to do. – Hendrik Jul 31 '19 at 07:47

Different FFT signal lengths for same length audio clips

1 Answers1