Compare similarity between two audio signals (singing recordings) in Python

Question

I'm working on a project to compare how similar someone's singing is to the original artist. Mostly interested in the pitch of the voice to see if they're in tune.

The audio files are in .wav format and I've been able to load them with the wave module and convert them to Numpy arrays. Then I built a frequency and a time vector to plot the signal.

raw_audio = wave.open("myAudio.WAV", "r")
audio = raw_audio.readframes(-1)
signal = np.frombuffer(audio, dtype='int16')
fs = raw_audio.getframerate()
timeDelta = 1/(2*fs)

#Get time and frequency vectors
start = 0
end = len(signal)*timeDelta
points = len(signal)

t = np.linspace(start, end, points)
f = np.linspace(0,fs,points)

If I have another signal of the same duration (they're landing at approximately 5-10 seconds). What would be the best way to compare these two signals for similarity?

I've thought of comparing the frequency domains and autocorrelation but I feel that both of those methods have a lot of drawbacks.

score 4 · Answer 1 · edited Jul 03 '22 at 08:01

I am faced with a similar Problem of evaluating the similarity of two Audio signals (one real, one generated by a machine learning pipeline). I have signal parts, where the comparison is very time-critical (time-difference between peaks representing arrival of different early reflections) and for this I will try out calculating the cross-correlation between the signals (more on that here: https://www.researchgate.net/post/how_to_measure_the_similarity_between_two_signal )

Since natural recordings of two different voices will be quite different in time domain, this would probably not be ideal for your problem.

For signals where Frequency information (like pitch and timbre) is of greater interest, I would work in frequency domain. You can for example calculate short-time-ffts (stft) or cqt (a more musical representation of the spectrum as it is mapped to octaves) for the two signals and then compare the similarities for example by calculating the Mean-Squared-Error (MSE) between the time windows of the two signals. Before transforming you should off course normalize the signals. STFT, CQT and normalization can easily be done and visualized with librosa

see here: https://librosa.org/doc/latest/generated/librosa.util.normalize.html

here: https://librosa.org/doc/latest/generated/librosa.cqt.html?highlight=cqt

here: https://librosa.org/doc/latest/generated/librosa.stft.html

and here: https://librosa.org/doc/main/generated/librosa.display.specshow.html)

Two things about this approach:

Dont make the time windows of your stfts too short. Spectra of human voices start somewhere in the hundret-hertz range (https://av-info.eu/index.html?https&&&av-info.eu/audio/speech-level.html here 350 Hz is given as the low end). So the Amount of samples in (or length of) your stft-time-windows should at least be:

(1 / 350 Hz) * sampling frequency

So if your recordings have 44100 Hz sampling frequency, your time window must be at least

(1 / 350 Hz) * 44100 Hz = 0.002857... sec * 44100 Samples / second = 126 Samples long.

Make it 128, thats a nicer number. That way you guarantee that a sound wave with fundamental frequency of 350 Hz can still be "seen" for at least one full Period in a single window. Of course bigger windows will give you more exact spectral representation.
Before transforming you should make sure that the two signals you are comparing represent the same sound events at the same time. So all of this doesn't work if the two singers didn't sing the same thing or not at the same speed or there are different background noises in the signals. Provided that you have dry recordings of only the voices and these voices sing the same thing at equal speed, you just need to make sure that the signal starts align. In general, you need to make sure that sound events (e.g. transients, silence, notes) align. When there is a long AAAH-sound in one signal, there should also be a long AAAh-sound in the other signal. You can make your evaluation somewhat more robust by increasing the stft windows even further, this will reduce time resolution (you will get less spectral representations of signals) but more sound events are evaluated together in one time window.

You could of course just generate one fft for each signal over the entire length but the results will be more meaningful if you generate stfts or cqts (or some other transform better suited for human hearing) over equal lengthed, short time windows, then calculate the mse for each pair of time windows (first time window of signal 1 and first window of signal 2, then the second window pair, then the third and so on).

Hope this helps.

Thanks for the great information. I'm just wondering, since we're directly measuring the error at each point in time, wouldn't this approach be absolutely sensitive to the slightest difference in timing? So if one of the singers is off by 20ms, the entire song is thus phased to the right and we will get bad results? What do you think about incorporating dynamic time warping? — ela16, Aug 07 '22 at 11:27
Also, the similarity of two singing voices could depend on several audio features (timbre, tone, etc.), how do you differentiate between the different features? And how can you compare the similarity for each of the features (e.g. when the two voices are very similar in one feature but dissimilar in another)? — ela16, Aug 07 '22 at 11:30
Hello :) On your first comment: Yes, in time-domain even an exact copy of a recording that is just out of phase a little (slight time delay as you describe) would lead to differences when comparing the two. Imagine a sine-wave with 1Hz frequenzy. If you play it twice and the second playback is delayed by 0.5s (exactly half the wavelength of the sine-wave) the two waves cancel each other out and you would get errors of up to 2* max_amplitude of the waves. However in frequency-domain the signals would be exactly the same (only one frequency-component at 1Hz) no matter the phase difference. — 2FingerTyper, Aug 19 '22 at 09:37
On your second comment: This is actually a very interesting problem. I would probably try to compare the different features separatily and then form an overall estimate of similarity by weighting the different metrics depending on which features I deem most important. For example I would try to estimate similarities in timbre (by evaluating frequency components) and maybe estimate the differences in pitches in another metric and maybe even form another metric that compares the onsets of the notes and Transients. Then I can form a mean of these 3 metrics (equally or in different proportion). — 2FingerTyper, Aug 19 '22 at 09:47

Compare similarity between two audio signals (singing recordings) in Python

1 Answers1