0

I have a python project where I have to make sure that there is no repeated songs in different folders even if they have different extensions (prefer to convert all m4a or wav files into mp3 but not a must) so I don't need to rebuild shazam or any other machine learning algorithm in this project as it's not needed because I already have the files

things I have tried: compare_mp3 is a library I've found that should be able to do this job for me but I need to have files in mp3 format so I tried to convert my files using pydub and kept getting FileNotFoundError although that I've installed ffmpeg

Ahmed Yasser
  • 109
  • 12
  • PCM format is the lingua franca of digital audio and any audio comparison test is performed when both files are in that format ... of course this detail may be hidden behind any library which performs an audio diff ... there are may gottchas when doing sound file diffs as if the start of audio may be offset between different files so the data diff will need to take this into account – Scott Stensland Feb 12 '21 at 13:46
  • so how can i use this format should i convert all my files(1000 file) into some temp folder and compare them while saving in a dictionary or something the source files then i should compare them together ?@ScottStensland – Ahmed Yasser Feb 12 '21 at 14:22
  • that sounds reasonable ... do you need to compare each file with all other files ? a N by N comparision ? so if U have 1000 files that would be 1000 * 1000 file pairs in which case might want to first think about alternative strategies ... are all files same audio duration ? or different ... if different lengths then will be quicker to only diff files with similar durations – Scott Stensland Feb 12 '21 at 14:53
  • actually I wasn't going to try that 1000*1000 process i was going to consider that there should be some similarity in names of the songs so i was going to only compare files that have at least 1 word common in their names, although your idea seems to be very good I'll put it in consideration . @ScottStensland – Ahmed Yasser Feb 12 '21 at 15:11

1 Answers1

0

First you need to decode them into PCM and ensure it has specific sample rate, which you can choose beforehand (e.g. 16KHz). You'll need to resample songs that have different sample rate. High sample rate is not required since you need a fuzzy comparison anyway, but too low sample rate will lose too much details.

You can use the following code for that:

ffmpeg -i audio1.mkv -c:a pcm_s24le output1.wav
ffmpeg -i audio2.mkv -c:a pcm_s24le output2.wav 

And below there's a code to get a number from 0 to 100 for the similarity from two audio files using python, it works by generating fingerprints from audio files and comparing them based out of them using cross correlation

It requires Chromaprint and FFMPEG installed, also it doesn't work for short audio files, if this is a problem, you can always reduce the speed of the audio like in this guide, be aware this is going to add a little noise.

# correlation.py
import subprocess
import numpy
# seconds to sample audio file for
sample_time = 500# number of points to scan cross correlation over
span = 150# step size (in points) of cross correlation
step = 1# minimum number of points that must overlap in cross correlation
# exception is raised if this cannot be met
min_overlap = 20# report match when cross correlation has a peak exceeding threshold
threshold = 0.5
# calculate fingerprint
def calculate_fingerprints(filename):
    fpcalc_out = subprocess.getoutput('fpcalc -raw -length %i %s' % (sample_time, filename))
    fingerprint_index = fpcalc_out.find('FINGERPRINT=') + 12
    # convert fingerprint to list of integers
    fingerprints = list(map(int, fpcalc_out[fingerprint_index:].split(',')))      
    return fingerprints  
    # returns correlation between lists
def correlation(listx, listy):
    if len(listx) == 0 or len(listy) == 0:
        # Error checking in main program should prevent us from ever being
        # able to get here.     
        raise Exception('Empty lists cannot be correlated.')    
    if len(listx) > len(listy):     
        listx = listx[:len(listy)]  
    elif len(listx) < len(listy):       
        listy = listy[:len(listx)]      

    covariance = 0  
    for i in range(len(listx)):     
        covariance += 32 - bin(listx[i] ^ listy[i]).count("1")  
    covariance = covariance / float(len(listx))     
    return covariance/32  
    # return cross correlation, with listy offset from listx
def cross_correlation(listx, listy, offset):    
    if offset > 0:      
        listx = listx[offset:]      
        listy = listy[:len(listx)]  
    elif offset < 0:        
        offset = -offset        
        listy = listy[offset:]      
        listx = listx[:len(listy)]  
    if min(len(listx), len(listy)) < min_overlap:       
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        return   
    #raise Exception('Overlap too small: %i' % min(len(listx), len(listy))) 
    return correlation(listx, listy)  
    # cross correlate listx and listy with offsets from -span to span
def compare(listx, listy, span, step):  
    if span > min(len(listx), len(listy)):      
    # Error checking in main program should prevent us from ever being      
    # able to get here.     
        raise Exception('span >= sample size: %i >= %i\n' % (span, min(len(listx), len(listy))) + 'Reduce span, reduce crop or increase sample_time.')

    corr_xy = []    
    for offset in numpy.arange(-span, span + 1, step):      
        corr_xy.append(cross_correlation(listx, listy, offset)) 
    return corr_xy  
    # return index of maximum value in list
def max_index(listx):   
    max_index = 0   
    max_value = listx[0]    
    for i, value in enumerate(listx):       
        if value > max_value:           
            max_value = value           
            max_index = i   
    return max_index  

def get_max_corr(corr, source, target): 
    max_corr_index = max_index(corr)    
    max_corr_offset = -span + max_corr_index * step 
    print("max_corr_index = ", max_corr_index, "max_corr_offset = ", max_corr_offset)
    # report matches    
    if corr[max_corr_index] > threshold:        
        print(('%s and %s match with correlation of %.4f at offset %i' % (source, target, corr[max_corr_index], max_corr_offset))) 

def correlate(source, target):  
    fingerprint_source = calculate_fingerprints(source) 
    fingerprint_target = calculate_fingerprints(target)     
    corr = compare(fingerprint_source, fingerprint_target, span, step)  
    max_corr_offset = get_max_corr(corr, source, target)  

if __name__ == "__main__":    
    correlate(SOURCE_FILE, TARGET_FILE)  

Code converted into python 3 from: https://shivama205.medium.com/audio-signals-comparison-23e431ed2207

Alejandro Garcia
  • 140
  • 2
  • 3
  • 15