I am very very newbie in speech processing. I am actually trying to do Noise Reduction. I am using Spectral Subtraction Method. While doing it, in many theory papers and algorithms, it says to take the frames of the audio signal.
For that, I took 20ms long for each frame i.e for Sampling Frequency = 16KHz, I would end up each frame with 16KHz * 20ms = 320 samples/frame.
windowed_frame = frame .* hamming(length(frame));
complex_spec = fft(windowed_frame,512);
mag_spec = abs(complex_spec);
phase_spec = angle(complex_spec);
Now, for noise signal it says:
Assume initial few non-speech frames as noise.
So, to get a noise estimate, it states
Take the mean of the first 3 or so frames.
And each frame if 320 samples long. Now, what does it mean to say to take mean/average of those first 3 frames?
The 3 frames contains total of 3*320 = 960 samples. Does it indicate, to take mean of those 960 values? But that would result only single value. But I would need a windowed size i.e 20ms sized noise_estimate.
Any Help?