2

I was wondering if someone could point me to a good tutorial or show me how to graph the amplitude from a byte array. The audio format I am using is: U LAW 8000.0 Hz, 8 bit, mono, 1 bytes/frame.

John Kane
  • 4,383
  • 1
  • 24
  • 42
  • It depends on what you mean by *amplitude*. Do you want *instantaneous* amplitude, or smoothed RMS/peak amplitude ? Or perhaps even frequency domain amplitude versus frequency (power spectrum, spectrogram, etc) ? – Paul R Mar 15 '10 at 16:30
  • I'm not sure which I need. Basically, I need to try to detect when someone starts and stops talking. – John Kane Mar 15 '10 at 16:36
  • OK - the algorithm for this kind of thing is known as Voice Activity Detection (VAD) - it's used in echo cancellation and various other telecomms applications. I'll add more in an answer below... – Paul R Mar 15 '10 at 17:10

2 Answers2

6

It sounds like you are interested in a short term smoothed RMS amplitude measurement. Usually to do this you take a rectified version of the input signal, and then apply a low pass filter to this, e.g.

x1 = abs(x); // x2 = rectified input signal
x2 = k * x2 + (1 - k) * x1; // simple single pole low pass recursive filter

x2 is the amplitude of the signal x. k is a factor < 1.0 which determines the time constant of the smoothing filter.

You will then have some kind of threshold value which you use to decide whether you are in silence (x2 < threshold) or speech (x2 >= threshold).

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • yes this is what I am looking for, thank you. When you say: x2 is the amplitude of the signal x; what is 'x' (sorry I have been working on this for far to an extended period of time). Also, is there a good way to calculate what value k should have or is there commonly known values? – John Kane Mar 15 '10 at 18:24
  • x is the input value at the current sample time (you can consider your stream of input data to be an array x[] if that helps). Typically k will be between 0.9 and 0.99 but you will want to experiment with this and the threshold etc to get the behaviour you want in terms of how quickly you switch between "silence" and "speech", how many false positives/negatives you want, etc. – Paul R Mar 15 '10 at 18:45
  • Once again thank you this helps a a lot. Do I need to do anything differently because of the encoding. – John Kane Mar 15 '10 at 19:01
  • @john: you'll need to convert your u-law samples to linear before processing, but this is pretty trivial to do. – Paul R Mar 15 '10 at 20:33
  • When you say linear you mean PCM? Also do you know where I can find a good tutorial on doing the conversion from ulaw to pcm. I have looked but I have not found anything explaining how to do this. – John Kane Mar 16 '10 at 12:37
  • Yes, you just need to convert the 8 bit µ-law samples to 16 bit (linear) signed integers. If you look at the Wikipedia entry for µ-law: and scroll to the bottom, the last link takes you to example C code for µ-law coding/decoding: – Paul R Mar 16 '10 at 15:25
0

Read about Fourier transform. But it's only a part of all you need to do in order to achieve what you want.

Roman
  • 64,384
  • 92
  • 238
  • 332
  • Poor answer - it doesn't really tell the guy *anything* about graphing amplitude. – Paul R Mar 15 '10 at 16:31
  • @Paul R: when I needed to do something similar I had to read article of about 50 pages only to understand the principle. It's not an easy problem. – Roman Mar 15 '10 at 16:33
  • I know what you are saying, I have done research on that and the dft and the fft. It is fun problem. – John Kane Mar 15 '10 at 16:41
  • It doesn't look like he is interested in frequency domain amplitude information, so the FFT is somewhat irrelevant. – Paul R Mar 15 '10 at 17:15