Big picture: Trying to identify proxy frauds in video interviews.
I have video clips of interviews. Each person has 2 or more interviews. As a first step I am trying to extract the audio from the interviews and trying to match them and identify if audio is from the same person.
I used python library librosa to parse the audio files and generate MFCC and chroma_cqt features of those files. I went ahead to also create a similarity matrix for those files. I want to convert this similarity matrix to a score between 0 to 100 where 100 is perfect match and 0 is totally different. After which I can identify a threshold and provide labels to the audio files.
Code:
import librosa
hop_length = 1024
y_ref, sr1 = librosa.load(r"audio1.wav")
y_comp, sr2 = librosa.load(r"audio2.wav")
chroma_ref = librosa.feature.chroma_cqt(y=y_ref, sr=sr1, hop_length=hop_length)
chroma_comp = librosa.feature.chroma_cqt(y=y_comp, sr=sr2, hop_length=hop_length)
mfcc1 = librosa.feature.mfcc(y_ref, sr1, n_mfcc=13)
mfcc2 = librosa.feature.mfcc(y_comp, sr2, n_mfcc=13)
# Use time-delay embedding to get a cleaner recurrence matrix
x_ref = librosa.feature.stack_memory(chroma_ref, n_steps=10, delay=3)
x_comp = librosa.feature.stack_memory(chroma_comp, n_steps=10, delay=3)
sim = librosa.segment.cross_similarity(x_comp, x_ref, metric='cosine')