4

I have a database which contains a videos streaming. I want to calculate the LBP features from images and MFCC audio and for every frame in the video I have some annotation. The annotation is inlined with the video frames and the time of the video. Thus, I want to map the time that i have from the annotation to the result of the mfcc. I know that the sample_rate = 44100

from python_speech_features import mfcc
from python_speech_features import logfbank
import scipy.io.wavfile as wav

audio_file = "sample.wav"
(rate,sig) = wav.read(audio_file)
mfcc_feat = mfcc(sig,rate)
print len(sig)        # 2130912
print len(mfcc_feat)  # 4831

Firstly, why the result of the length of the mfcc is 4831 and how to map that in the annotation that i have in seconds? The total duration of the video is 48second. And the annotation of the video is 0 everywhere except the 19-29sec windows where is is 1. How can i locate the samples within the window (19-29) from the results of the mfcc?

mat kelcey
  • 3,077
  • 2
  • 30
  • 35
konstantin
  • 853
  • 4
  • 16
  • 50
  • Just a comment: Librosa has various feature extraction methods. It may help your work. https://github.com/librosa/librosa/blob/master/examples/LibROSA%20demo.ipynb – dkato Dec 06 '17 at 10:45

1 Answers1

3

Run

 mfcc_feat.shape

You should get (4831,13) . 13 is your MFCC length (default numcep is 13). 4831 is the windows. Default winstep is 10 msec, and this matches your sound file duration. To get to the windows corresponding to 19-29 sec, just slice

mfcc_feat[1900:2900,:]

Remember, that you can not listen to the MFCC. It just represents the slice of audio of 0.025 sec (default value of winlen parameter).

If you want to get to the audio itself, it is

sig[time_beg_in_sec*rate:time_end_in_sec*rate]
igrinis
  • 12,398
  • 20
  • 45
  • But there isnt also some overlap? This is what confused me. – konstantin Nov 29 '17 at 14:55
  • I want to take the features of the signal and to use them for some further analysis. What do you mean to listen the MFCC (its coefficients of the frequency domain - how can i listen to them)? – konstantin Nov 29 '17 at 15:03
  • re: overlap : Of course. MFCCs are calculated over `winlen`, and each `winstep`. So number of windows depends on `winstep`. Quality of features depend on 'winlen'. For different applications you use different windows sizes. – igrinis Nov 29 '17 at 15:09
  • re: listening. I am glad you understand that. – igrinis Nov 29 '17 at 15:12
  • Is it easy to calculate the jitter and pitch for the same windows, in order to enrich my MFCC features? Is it straight forward to do so? – konstantin Nov 29 '17 at 15:13
  • You can do it. Pitch is not so trivial or reliable. You can't calculate pitch for the *same* windows (you need about half a second), but you can calculate pitch over whole signal and take value corresponding to the specific time window. – igrinis Nov 29 '17 at 15:18
  • Any recommended python code for jitter and pitch? I want to check if they are useful for classification. – konstantin Nov 29 '17 at 15:25
  • Also you mean mfcc_feat.shape not sig.shape, right? – konstantin Nov 29 '17 at 15:31
  • ALso is it possible that i am receiving results in both channels? During the debugging i have checked that the MFCC has size indeed 4831 however there are two vectors inside it one its of size 13 and one of siez 26. Is there a chance that it is represantation of the two channels? – konstantin Nov 29 '17 at 15:39
  • I can't answer that. It depends on `python_speech_features` implementation. It might be energies of the filter banks, but this you should check. re code, I did not use any Python implementations, sorry. You might search for Yin pitch detector. – igrinis Nov 29 '17 at 15:48