0

I am very very newbie in speech processing. I am actually trying to do Noise Reduction. I am using Spectral Subtraction Method. While doing it, in many theory papers and algorithms, it says to take the frames of the audio signal.

For that, I took 20ms long for each frame i.e for Sampling Frequency = 16KHz, I would end up each frame with 16KHz * 20ms = 320 samples/frame.

windowed_frame = frame .* hamming(length(frame));
complex_spec = fft(windowed_frame,512);        
mag_spec = abs(complex_spec);
phase_spec = angle(complex_spec);

Now, for noise signal it says:

Assume initial few non-speech frames as noise.

So, to get a noise estimate, it states

Take the mean of the first 3 or so frames.

And each frame if 320 samples long. Now, what does it mean to say to take mean/average of those first 3 frames?

The 3 frames contains total of 3*320 = 960 samples. Does it indicate, to take mean of those 960 values? But that would result only single value. But I would need a windowed size i.e 20ms sized noise_estimate.

Any Help?

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
Sagaryal
  • 415
  • 4
  • 15
  • 1
    there is a sister site which focuses on digital signal processing ... if no help here you may want to move your question ... see https://dsp.stackexchange.com/search?q=+sound+frames+in+Speech+Processing – Scott Stensland Jul 21 '17 at 00:11
  • I would suspect this means to take the element wise average of the spectrum from each of the first three frames, giving you an average power spectrum from the first 960 samples. – Tom Wyllie Jul 23 '17 at 13:42

1 Answers1

0

You need noise spectrum estimate, so you average mag_spec in first 3 frames, not the signal values.

 noise_spec = (mag_spec_1 + mag_spec_2 + mag_spec_3) / 3

The result will be 512 numbers, basically the noise energy for every frequency bin.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87