Extracting Instrument Qualities From Audio Signal

Question

I'm looking to write a function that takes an audio signal (assuming it contains a single instrument playing), out of which I would like to extract the instrument-like features out of the audio and into a vector space. So in theory, if I had two signals with similar-sounding instruments (such as two pianos), their respective vectors should be fairly similar (by euclidian distance/cosine similarity/etc.). How would one go about doing this?

What I've tried: I'm currently extracting (and temporally averaging) the chroma energy, spectral contrast, MFCC (and their 1st and 2nd derivatives), as well as the Mel spectrogram and concatenating them into a single representation vector:

# expects a numpy array (dimensions: [1, num_samples], 
# similar to torchaudio.load() output). 

# assume all signals contain a constant number of samples and sampled at 44.1Khz
def extract_instrument_features(signal, sr):
  # define hyperparameters:
  FRAME_LENGTH = 1024
  HOP_LENGTH = 512

  # compute and perform temporal averaging of the chroma energy:
  ce = torch.Tensor(librosa.feature.chroma_cens(signal_np, sr))
  ce = torch.mean(ce, axis=1)

  # compute and perform temporal averaging of the spectral contrast:
  spc = torch.Tensor(librosa.feature.spectral_contrast(signal_np, sr))
  spc = torch.mean(spc, axis=1)

  # extract MFCC and its first & second derivatives:
  mfcc = torch.Tensor(librosa.feature.mfcc(signal_np, sr, n_mfcc=13))
  mfcc_1st = torch.Tensor(librosa.feature.delta(mfcc))
  mfcc_2nd = torch.Tensor(librosa.feature.delta(mfcc, order=2))

  # temporal averaging of MFCCs:
  mfcc = torch.mean(mfcc, axis=1)
  mfcc_1st = torch.mean(mfcc_1st, axis=1)
  mfcc_2nd = torch.mean(mfcc_2nd, axis=1)

  # define the mel spectrogram transform:
  mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=target_sample_rate, 
    n_fft=1024, 
    hop_length=512,
    n_mels=64
  )

  # extract the mel spectrogram:
  ms = mel_spectrogram(signal)
  ms = torch.mean(ms, axis=1)[0]

  # concatenate and return the feature vector:
  features = [ce, spc, mfcc, mfcc_1st, mfcc_2nd]
  return np.concatenate(features)

score 1 · Accepted Answer · answered Jan 24 '22 at 23:21

The part of the instrument audio that gives its distinctive sound, independently from the pitch played, is called the timbre. The modern approach to get a vector representation, would be to train a neural network. This kind of learned vector representation is often called to create an audio embedding.

An example implementation of this is described in Learning Disentangled Representations Of Timbre And Pitch For Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders (2019).

Extracting Instrument Qualities From Audio Signal

1 Answers1