Source and for further explanation: https://dev.to/hiisi13/find-an-audio-within-another-audio-in-10-lines-of-python-1866
First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.
You can use the following code for that:
ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav
Then you can use the following code, it normalizes PCM data (i.e. find maximum sample value and rescale all samples so that sample with largest amplitude uses entire dynamic range of data format and then converts it to spectrum domain (FFT) and finds a peak using cross correlation to finally return the offset in seconds
import argparse
import librosa
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
def find_offset(within_file, find_file, window):
y_within, sr_within = librosa.load(within_file, sr=None)
y_find, _ = librosa.load(find_file, sr=sr_within)
c = signal.correlate(y_within, y_find[:sr_within*window], mode='valid', method='fft')
peak = np.argmax(c)
offset = round(peak / sr_within, 2)
fig, ax = plt.subplots()
ax.plot(c)
fig.savefig("cross-correlation.png")
return offset
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--find-offset-of', metavar='audio file', type=str, help='Find the offset of file')
parser.add_argument('--within', metavar='audio file', type=str, help='Within file')
parser.add_argument('--window', metavar='seconds', type=int, default=10, help='Only use first n seconds of a target audio')
args = parser.parse_args()
offset = find_offset(args.within, args.find_offset_of, args.window)
print(f"Offset: {offset}s" )
if __name__ == '__main__':
main()