11

I am building a tool which is supposed to run on a server and analyze sound files. I want to do this in Ruby as all my other tools are written in Ruby as well. But I am having trouble finding a good way of accomplishing this.

A lot of the examples I've found has been doing visualizers and graphical stuff. I just need the FFT data, nothing more. I need to both get the audio data, and do a FFT on it. My end goal is to calculate some stuff like the mean/median/mode, 25th-percentile, and 75th-percentile over all frequencies (weighted amplitude), the BPM, and perhaps some other good characteristic to later be able to cluster similar sounds together.

First I tried to use ruby-audio and fftw3 but I never go the two to really work together. The documentation wasn't good either so I really didn't know what data was being shuffled around. Next I tried to use bplay / brec and limit my Ruby script to just use STDIN and perform an FFT on that (still using fftw3). But I couldn't get bplay/brec to work since the server doesn't have a sound card and I didn't manage to just get the audio directly to STDOUT without going to an audio device first.

Here's the closest I've gotten:

# extracting audio from wav with ruby-audio
buf = RubyAudio::Buffer.float(1024)
RubyAudio::Sound.open(fname) do |snd|
    while snd.read(buf) != 0
        # ???
    end
end

# performing FFT on audio
def get_fft(input, window_size)
    data = input.read(window_size).unpack("s*")
    na = NArray.to_na(data)
    fft = FFTW3.fft(na).to_a[0, window_size/2]
    return fft
end

So now I'm stuck and can't find any more good results on Google. So perhaps you SO guys can help me out?

Thanks!

Christoffer Reijer
  • 1,925
  • 2
  • 21
  • 40
  • Perhaps this previous discussion might be helpful: http://stackoverflow.com/questions/2834548/ruby-play-pause-resume-aac-audio-files – fmendez Feb 22 '13 at 19:49
  • Could you elaborate on why you are stuck? Please include error messages or gaps in your understanding of how things should work. – Randall Cook Feb 23 '13 at 07:53
  • I have added my code so far. I have a huge gap between reading the data using ruby-audio and extracting the FFT using fftw3. See the comment with three question marks. I have the wav data inside buf but I don't know what the data really is/represent. Are there headers in there? Is it compressed/encoded? etc, etc. I want to get the data into get_fft (which is taken almost verbatim from another SO post). – Christoffer Reijer Feb 23 '13 at 14:01

2 Answers2

10

I think there are two problems here. One is getting the samples, the other is performing the FFT.

To get the samples, there are two main steps: decoding and downmixing. To decode wav files, you just need to parse the header so you can know how to interpret the samples. For mp3 files, you'll need to do a full decode. Once the audio has been decoded, if you are not interested in processing the stereo channels separately, you may need to downmix it into mono, since the FFT expects a single channel as input. If you don't mind venturing outside of Ruby, the sox tool makes this easy. For example sox song.mp3 -b 16 song.raw channels 1 should convert an mp3 to a mono file of pure PCM samples (i.e. 16-bit integers). BTW, a quick search revealed the ruby/audio library (perhaps it is the one mentioned in your post). It looks pretty good, especially since it wraps libsndfile.

To perform the FFT, I see three options. One is to use this snippet of code that performs an FFT. I'm no Ruby expert, but it looks like it might be OK. The second option is to use NArray. It has a ton of mathematical methods, including FFTW, available in a separate module, a tarball for which is linked in the middle of the NArray page. The third option is to write your own FFT code. It's not an especially complicated algorithm, and could give you great experience with numerical processing in Ruby (if you need that).

You are probably aware of this, but the FFT expects complex input and generates complex output. Audio signals are real, of course, so the imaginary component of the input should always be zero (a + 0*i). Since your input is real, the output will be symmetrical about the midpoint of the output array. You can safely ignore the upper half. If you want the energy in a particular frequency bin (they are spaced linearly up to half the sample rate), you'll need to compute the magnitude of the complex value (sqrt(real*real + imag*imag)).

One more thing: Because frequency zero (the DC offset of the signal) and the Nyquist frequency (half the sample rate) have no phase components, some FFT implementations put them together into the same complex bin (one in the real component, one in the imaginary component, typically of the first bin). You can create some simple signals (all 1s for just a DC signal, and alternating +1, -1 for a Nyquist signal) and see what the FFT output looks like.

Randall Cook
  • 6,728
  • 6
  • 33
  • 68
  • Thanks for the long answer. This is pretty much how I've been thinking. But I've not been able to really put all this together. I added some code so show the furthest I got when using ruby-audio (the one you linked) and the fftw3 gem. – Christoffer Reijer Feb 23 '13 at 06:50
  • 1
    Often when I am having trouble putting things together, I start very small and just add one step at a time, adding lots of diagnostic code (or checking variables closely in the debugger) to make sure things are working as expected: can I open the file? can I read data? is the format of the data what I expect? can I transform the data? does it still look right? etc. – Randall Cook Feb 23 '13 at 07:51
  • Yes, but I am stuck with: what is this data that I'm looking at and how should I feed it into the FFT function? Should I just give it the content of the buffer (call to_a on buf) or do I need to process it before? I am not sure what the data that I get from ruby-audio represent. – Christoffer Reijer Feb 23 '13 at 13:58
  • Got it. Uncompressed digital audio, ready to for an FFT, is typically an array of 16-bit integers, each representing (not exactly, but a good conceptual model) a voltage on an analog audio cable carrying the signal. I recommend printing out the data as an array of integers and see what you get. You should see a lot of numbers close to zero. You could even take these numbers (as text) and load them into a spreadsheet/Matlab/Octave and graph them. You should see a sound wave. You can use a digital audio editor like Audacity (free and open source) to view the source audio and . . . – Randall Cook Feb 24 '13 at 02:05
  • . . . compare it with what you extract. If they match, you are ready to proceed. If not, you'll have to look closely at the signal path and see what is going on at each step. – Randall Cook Feb 24 '13 at 02:06
  • Ok, I've extracted parts of the buffer and plotted it in Excel, which gives me the same curve as I get in Audacity. So calling buf.to_a will give me the soundwave. My question now is, given the code above for doing FFT: can I just set data to buf.to_a and does it matter what I choose for window_size (I currently use 1024). – Christoffer Reijer Feb 24 '13 at 09:12
  • ... or should I read the _whole_ sound wave into an array and _then_ feed it into FFTW3.fft as an NArray? – Christoffer Reijer Feb 24 '13 at 09:53
  • Either way. The FFT provides the energy and phase of all frequency components of its input. If you pass the the entire song, you'll get the transform of the entire song, as if the entire song played in one moment. Sometimes this is useful (for measuring the overall frequency profile, for instance), but more often people break the audio into chunks (called windows, or frames) and pass them sequentially to the FFT. This gives a frequency profile that changes over time. – Randall Cook Feb 24 '13 at 21:02
  • The FFT generally requires that its input size be a power of two. Sometimes it pads the input with zeroes to bring it up to the correct size. Chunks of 1024 samples is a good choice. Sometimes people overlap the windows by 50%, especially if they are planning on doing filtering in the frequency domain. With size 1024, this implies a "hop size" of 512 samples. Often people "window" the input before passing it to the FFT (i.e. fade it in and out) which can reduce noise and artifacts in the FFT. Look up Hanning and Hamming windows functions for more information. – Randall Cook Feb 24 '13 at 21:08
10

Here's the final solution to what I was trying to achieve, thanks a lot to Randall Cook's helpful advice. The code to extract sound wave and FFT of a wav file in Ruby:

require "ruby-audio"
require "fftw3"

fname = ARGV[0]
window_size = 1024
wave = Array.new
fft = Array.new(window_size/2,[])

begin
    buf = RubyAudio::Buffer.float(window_size)
    RubyAudio::Sound.open(fname) do |snd|
        while snd.read(buf) != 0
            wave.concat(buf.to_a)
            na = NArray.to_na(buf.to_a)
            fft_slice = FFTW3.fft(na).to_a[0, window_size/2]
            j=0
            fft_slice.each { |x| fft[j] << x; j+=1 }
        end
    end

rescue => err
    log.error "error reading audio file: " + err
    exit
end

# now I can work on analyzing the "fft" and "wave" arrays...
Christoffer Reijer
  • 1,925
  • 2
  • 21
  • 40
  • 1
    That looks about right. +1 for posting your code. I'm glad you got unblocked and could create something that works. BTW, a great way to say thanks on Stack Overflow is to upvote and/or accept an answer, if you haven't done so already. ;) – Randall Cook Feb 27 '13 at 01:37
  • I upvoted your post but had to wait a while before I could accept my own answer. :) – Christoffer Reijer Feb 27 '13 at 08:12
  • @ChristofferBrodd-Reijer your code works great to fingerprint wav files, but the fingerprint is too big. Did you found a solution to improve speed and shrink the fingerprint? – Rafael Fragoso Feb 24 '14 at 12:38
  • Yes, I did. I only did a fingerprint on a small section (3-10 seconds) in the beginning, middle, and end of the song. This proved good enough for solving my problem. – Christoffer Reijer Feb 24 '14 at 19:11