Music Detection (Not Identification)

Question

I'm trying to build a C# app that detects when music is present in a video. I can get at the Audio find, in whatever format required. I however have hit a brick wall in music detection.

There are loads of posts about audio fingerprinting and how to do that in C#/any language. However, I want rough in/out times that music occurs in a film, I'm not concerned what the music is.

The music is unlikely to exist in any fingerprint databases. So would likely be an entirely computational analysis.

Are there any clever ideas? Or am I going to be best implementing a beat detection algorithm and processing it piece by piece. Then estimating in/out points?

Frankly, this seems worthy of a research paper. You may want to try searching those as well. — RBarryYoung, Jan 03 '13 at 16:46
Are you looking for background/scene-transitional music as well? — RBarryYoung, Jan 03 '13 at 17:10
I agree with @RBarryYoung try searching for something similar to "music detection wavelet" - You'll be confronted with many research papers dotted with lots of equations. Wavelets are a more complicated method than the FFT of getting frequency information from a signal. — Phill, Jan 03 '13 at 17:23
Look into projects that have done music analysis, like Aubio and echonest. — Bjorn Roche, Jan 03 '13 at 19:04

RBarryYoung · Accepted Answer · 2013-01-03T17:09:09.347

4

There are only two things that I can think of that clearly distinguish "Music" from all other Audio/sounds:

Meter: Virtually all composed music has a meter. In theory this should be detectable with an FFT, but using the frequency range of apprx. 0.25hz to 10hz (instead of the usual 20hz-20Khz). In practice? I don't know, but it seems worth a try.
Tuning: Something common to almost all professional music including the voices of professional singers (when they are musically accompanied), but not to any other sounds is that they will all be in the same "tuning" of a 12-tone Equal Tempered scale. In other words, their frequencies will always be separated by exact multiple powers of 2^(1/12). Once the tuning is established they will never be in the gaps in between these steps. Normal sounds, including human voices, are spread all over the spectrum but music is almost always within +/- 10 Cents of a scaled note.

Method #1 is iffy, I don't know if anyone's ever tried it.

But #2 is definite, you can actually see this with an Audio Spectrum Analyzer, but the FFT has to have very high discrimination (at least 36 divisions per octave). But there are some catches, such as:

Differentiating between the music and other simultaneous sound/noise
Stringed instruments, like guitars and violins, which often "bend" notes out of tune
Trombones and unaccompanied human voices, that can "slide" between notes, or use Just-temper instead of Equal-temper for chords.
Programmatically establishing what the "tune" is at different places in the film (its not necessarily absolute, just stable within any one piece of music)
Harmonics: musical notes are usually more than simple sine waves, which means that there are a lot of harmonic frequencies mixed in there. Harmonics aren't exponential like scales, they are integer multiples, so they don't line up with the base notes. Fortunately, harmonics are almost always of lower amplitude than the base notes, so it should be possible to just "look for the peaks".

Well, those are my "clever" ideas. Now it's just a small matter of implementation ... ;-)

edited Jan 03 '13 at 17:09

answered Jan 03 '13 at 16:42

RBarryYoung

55,398
14
96
137

Clever comments indeed... I've thought about tuning/key detection. One of the larger issues being an odd one... A lot of the music I'm likely to detect will be bangra/bollywood songs. Which have a lot of tonal slides/vibrato that I imagine would cause all sorts of issues! I'll go down the beat/meter detection route. – Ben Ford Jan 03 '13 at 16:59
Ouch! Indian music very well may not follow 12-tone equal temperament either. I'd really encourage looking for research work on this, it's a tough nut to crack. – RBarryYoung Jan 03 '13 at 17:04
Method 1 has been done-ish. You can measure the energy of a song and find the bpm. I did it in Matlab for a project in college. – Josh C. Jan 03 '13 at 18:34
FFT for beat detection?!? Why do people think the FFT is magic? If you want to isolate frequency components, use a band-pass filter. Also, that range is not what you want (it's sub audible and will be filtered out of a lot of recorded music). Try something higher, like 20-100. This will get you started with filtering: http://blog.bjornroche.com/2012/08/basic-audio-eqs.html – Bjorn Roche Jan 03 '13 at 19:04
@BjornRoche: I never said FFT for Beat detection, I said it for *meter* detection (though I suppose it would work for beats also). And you are dead wrong on both points because you are talking about *tones* and I am talking about periodic repetition. FFT works on a lot more things than merely "audible tones" because it's essentially the same thing as convolution which means that it can be employed to find *any* simple repetitive pattern in data, assuming it can be filtered out of the noise and the pattern is both strong and consistent enough. – RBarryYoung Jan 03 '13 at 19:13
Tones **are** periodic repetition. As for using FFT for meter detection, I suppose that is possible, with huge windows, but you would have to find the envelope first, rather than finding the FFT of the raw signal. Obviously, you'd also want to downsample tremendously as well since the windows would be so huge. Better techniques are available. Several are implemented in Aubio. – Bjorn Roche Jan 03 '13 at 21:28
@BjornRoche: "Tones *are* periodic repetition." Yes, that's exactly what I said, except that Tones are a *subset* of the forms of periodic repetition, even the audible ones. If there are better methods, then you should feel free to help out the OP by posting an answer with a link to them. But whether there are better methods or not, FFT is applicable to a *whole lot more* than the narrow use of audible tone frequency identification that you were talking about above. In fact, if there is any algorithms ever, that can be called "*magic*" then FFT/Fourier Analysis/Convolution is it. – RBarryYoung Jan 03 '13 at 21:39

score 0 · Answer 2 · answered Jan 03 '13 at 16:17

0

you can use 'Microsoft Expression Encoder' to work with videos and audios

answered Jan 03 '13 at 16:17

Rashedul.Rubel

3,446
25
36

Expression is just an encoding/lightweight editing tool. Not really anything to do with beat detection/audio analysis. – Ben Ford Jan 03 '13 at 17:01

score 0 · Answer 3 · answered Feb 11 '13 at 00:27

The OP's problem can be summarized as follows:

In the generalized audio stream of a video, try to detect "music" versus "everything else".

Where "music" is not likely to exist in fingerprint databases.

And where "everything else" in this context must include:

speech
silence
synthetic sounds
foley sounds (explosions, gunshots, footfalls, etc.)

We must also assume that the audio soundtrack of a generalized video is highly processed with echo, reverb, multichannel panning, etc.

In the general video case, all of the above audio elements would be mixed together into the final audio, making the problem domain absolutely immense.

This is a very challenging problem, with most likely no simple or robust solution.

In support of this premise, a general music classifier (let's call it MuCLAS), where the unknown music sample is a member of the classifier training set, is a very difficult problem, due to the significant expense involved in creating the training set, and in tuning and creating the classifier index.

But the OP's problem domain is much larger than the MuCLAS problem domain, due to the much higher entropy of the OP's unknown data set. This implies much higher complexity and cost, relative to MuCLAS.

Another supporting argument for the above premise, is that the state of the art in general speech recognition assumes and insists upon, much lower entropy in the unknown data set, than the implied entropy of the OP's data set.

And speech recognition is one of the best funded problems in the general field of autonomous pattern recognition.

Music Detection (Not Identification)

3 Answers3