I'm looking to write a function that takes an audio signal (assuming it contains a single instrument playing), out of which I would like to extract the instrument-like features out of the audio and into a vector space. So in theory, if I had two signals with similar-sounding instruments (such as two pianos), their respective vectors should be fairly similar (by euclidian distance/cosine similarity/etc.). How would one go about doing this?
What I've tried: I'm currently extracting (and temporally averaging) the chroma energy, spectral contrast, MFCC (and their 1st and 2nd derivatives), as well as the Mel spectrogram and concatenating them into a single representation vector:
# expects a numpy array (dimensions: [1, num_samples],
# similar to torchaudio.load() output).
# assume all signals contain a constant number of samples and sampled at 44.1Khz
def extract_instrument_features(signal, sr):
# define hyperparameters:
FRAME_LENGTH = 1024
HOP_LENGTH = 512
# compute and perform temporal averaging of the chroma energy:
ce = torch.Tensor(librosa.feature.chroma_cens(signal_np, sr))
ce = torch.mean(ce, axis=1)
# compute and perform temporal averaging of the spectral contrast:
spc = torch.Tensor(librosa.feature.spectral_contrast(signal_np, sr))
spc = torch.mean(spc, axis=1)
# extract MFCC and its first & second derivatives:
mfcc = torch.Tensor(librosa.feature.mfcc(signal_np, sr, n_mfcc=13))
mfcc_1st = torch.Tensor(librosa.feature.delta(mfcc))
mfcc_2nd = torch.Tensor(librosa.feature.delta(mfcc, order=2))
# temporal averaging of MFCCs:
mfcc = torch.mean(mfcc, axis=1)
mfcc_1st = torch.mean(mfcc_1st, axis=1)
mfcc_2nd = torch.mean(mfcc_2nd, axis=1)
# define the mel spectrogram transform:
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
sample_rate=target_sample_rate,
n_fft=1024,
hop_length=512,
n_mels=64
)
# extract the mel spectrogram:
ms = mel_spectrogram(signal)
ms = torch.mean(ms, axis=1)[0]
# concatenate and return the feature vector:
features = [ce, spc, mfcc, mfcc_1st, mfcc_2nd]
return np.concatenate(features)