2

My friend Prasad Raghavendra and me, were trying to experiment with Machine Learning on audio.

We were doing it to learn and to explore interesting possibilities at any upcoming get-togethers. I decided to see how deep learning or any machine learning can be fed with certain audios rated by humans (evaluation).

To our dismay, we found that the problem had to be split to accommodate for the dimensionality of input. So, we decided to discard vocals and assess by accompaniments with an assumption that vocals and instruments are always correlated.

We tried to look for mp3/wav to MIDI converter. Unfortunately, they were only for single instruments on SourceForge and Github and other options are paid options. (Ableton Live, Fruity Loops etc.) We decided to take this as a sub-problem.

We thought of FFT, band-pass filters and moving window to accommodate for these.

But, we are not understanding as to how we can go about splitting instruments if chords are played and there are 5-6 instruments in file.

  1. What are the algorithms that I can look for?

  2. My friend knows to play Keyboard. So, I will be able to get MIDI data. But, are there any data-sets meant for this?

  3. How many instruments can these algorithms detect?

  4. How do we split the audio? We do not have multiple audios or the mixing matrix

  5. We were also thinking about finding out the patterns of accompaniments and using those accompaniments in real-time while singing along. I guess we will be able to think about it once we get answers to 1,2,3 and 4. (We are thinking about both Chord progressions and Markovian dynamics)

Thanks for all help!

P.S.: We also tried FFT and we are able to see some harmonics. Is it due to Sinc() in fft when rectangular wave is input in time domain? Can that be used to determine timbre?

FFT of the signals considered

We were able to formulate the problem roughly. But, still, we are finding it difficult to formulate the problem. If we use frequency domain for certain frequency, then the instruments are indistinguishable. A trombone playing at 440 Hz or a Guitar playing at 440 Hz would have same frequency excepting timbre. We still do not know how we can determine timbre. We decided to go by time domain by considering notes. If a note exceeds a certain octave, we would use that as a separate dimension +1 for next octave, 0 for current octave and -1 for the previous octave.

If notes are represented by letters such as 'A', 'B', 'C' etc, then the problem reduces to mixing matrices.

O = MI during training.
M is the mixing matrix that will have to be found out using the known O output and I input of MIDI file.

During prediction though, M must be replaced by a probability matrix P which would be generated using previous M matrices.

The problem reduces to Ipredicted = P-1O. The error would then be reduced to LMSE of I. We can use DNN to adjust P using back-propagation.

But, in this approach, we assume that the notes 'A','B','C' etc are known. How do we detect them instantaneously or in small duration like 0.1 seconds? Because, template matching may not work due to harmonics. Any suggestions would be much appreciated.

double-beep
  • 5,031
  • 17
  • 33
  • 41
Akshay Rathod
  • 73
  • 1
  • 6
  • 1
    Polyphonic decomposition seems to still be research topic. There have been hundreds of research papers on this topic presented at this conference: http://www.music-ir.org/mirex/wiki/MIREX_HOME – hotpaw2 Oct 19 '17 at 20:15
  • @hotpaw2 Thanks! How do I begin? Can I assume varying tempo with multiple instruments? That may be physically impossible to distinguish (given harmonics). But, if I trivialise it, say, none of the instruments overlap either with notes or with other instruments, it will have no significance. How do I construct input and output for this? I have read about 'classes of instruments' manually classified. But, I would like to automate entire process - It will possibly use LMS in evaluation function. Any insights will be appreciated! – Akshay Rathod Oct 20 '17 at 17:23
  • What assumptions are needed for the first “baby steps” may themselves be your first research problem. – hotpaw2 Oct 20 '17 at 17:33
  • This is a current, open are of research that is incredibly complex, and may not even be reliably possible. Rethink why you need this for your original problem, or be prepared to spend many years just working on this. – Linuxios Oct 23 '17 at 19:56
  • @Linuxios Thanks. That is what I came to understand. Everytime I try to find a solution, some more questions pop up. But, I am learning a lot of deep learning by trying. Thank you again – Akshay Rathod Oct 24 '17 at 05:10

2 Answers2

0

Splitting out the different parts is a machine learning problem all to its own. Unfortunately, you can't look at this problem in audio land only. You must consider the music.

You need to train something to understand musical patterns and progressions in the context of the type of music you give it. It needs to understand what the different instruments sound like, both mixed and not mixed. It needs to understand how these instruments are often played together, if it's going to have any chance at all at separating what's going on.

This is a very, very difficult problem.

Brad
  • 159,648
  • 54
  • 349
  • 530
  • Thank you. Yes I know it is a hard problem. But, how do I get a baby-start (may be with two instruments? One playing melody and other a guitar playing chords?) Because, humans seem to recognise many instruments and can approximate chords like E7 to E. We are really interested as we will learn a lot about machine learning and will find right friends too. Thanks again – Akshay Rathod Oct 20 '17 at 02:52
  • @AkshayRathod You might be surprised about what humans can and can't do, and how much musical context matters. I could play the top half of a CMaj chord, but if I was previously playing an A in the bass, you're going to hear that as Amin7. Good luck getting musical context out of FFT. :-) There are even situations where humans hear things that don't exist. Play a downward scale on a trombone, all the way to the lowest it will go. For the last 5 or 6 notes, there will be no fundamental! Humans will perceive it though due to musical context. – Brad Oct 20 '17 at 03:00
  • @AkshayRathod Depending on what you're trying to do, you're probably better off getting stem files for songs you want to analyze. – Brad Oct 20 '17 at 03:01
  • These insights are really helpful. I am firstly interested in splitting a wav/mp3 file (I will think about chord progression later). If FFT is a bad idea (I have seen in research papers), people go for time series belief networks. But, for polyphonic sounds with varying tempo, how can I do some quick hack with say python for two/three instruments? Should input be the MIDI stem file? I guess the machine learning input will be 2D array and output will be 2D array. With instruments x notes as dimensions. 2D array would be merged for final output. Is that correct? Many thanks – Akshay Rathod Oct 20 '17 at 03:15
0

This is a very hard problem mainly because converting audio to pitch isnt very simple due to Nyquist folding harmonics that are 22Khz+ back down and also other harmonic introductions such as saturators/distortion and other analogue equipment that introduce harmonics.

The fundamental harmonic isnt always the loudest which is why your plan will not work.

The hardest thing to measure would be a distorted guitar. The harmonic some pedals/plugins can make is crazy.

Definity
  • 691
  • 2
  • 11
  • 31