Compare two audio files with persons speaking and compute the similarity score

Question

Big picture: Trying to identify proxy frauds in video interviews.

I have video clips of interviews. Each person has 2 or more interviews. As a first step I am trying to extract the audio from the interviews and trying to match them and identify if audio is from the same person.

I used python library librosa to parse the audio files and generate MFCC and chroma_cqt features of those files. I went ahead to also create a similarity matrix for those files. I want to convert this similarity matrix to a score between 0 to 100 where 100 is perfect match and 0 is totally different. After which I can identify a threshold and provide labels to the audio files.

Code:

import librosa

hop_length = 1024
y_ref, sr1 = librosa.load(r"audio1.wav")
y_comp, sr2 = librosa.load(r"audio2.wav")
chroma_ref = librosa.feature.chroma_cqt(y=y_ref, sr=sr1, hop_length=hop_length)
chroma_comp = librosa.feature.chroma_cqt(y=y_comp, sr=sr2, hop_length=hop_length)

mfcc1 = librosa.feature.mfcc(y_ref, sr1, n_mfcc=13)
mfcc2 = librosa.feature.mfcc(y_comp, sr2, n_mfcc=13)


# Use time-delay embedding to get a cleaner recurrence matrix
x_ref = librosa.feature.stack_memory(chroma_ref, n_steps=10, delay=3)
x_comp = librosa.feature.stack_memory(chroma_comp, n_steps=10, delay=3)

sim = librosa.segment.cross_similarity(x_comp, x_ref, metric='cosine')

i have never used it but 'vosk' has a speaker recognition model. https://github.com/alphacep/vosk-api — ruff09, Sep 28 '22 at 08:31
What does proxy fraud mean exactly? Is it that a person is trying to pass for someone else? — Jon Nordby, Oct 14 '22 at 16:44
@JonNordby some other person is giving voice over in the interview while the actual candidate is just lip sync it. — The6thSense, Oct 17 '22 at 08:54
Then it sounds like what you should do is to compare the audio with video, ie detect the lip syncing itself? This could be challenging under varying video/audio transmission conditions, but with a suitable dataset is something deep learning should be able to solve (assuming that humans can solve it) — Jon Nordby, Oct 17 '22 at 11:06
@JonNordby I started with comparing audio because I thought comparing lip syncing will be very complex and I may need to create model from scratch. — The6thSense, Oct 17 '22 at 12:00
Yes it is a rather big undertaking. So starting somewhere else may make sense :) — Jon Nordby, Oct 17 '22 at 15:12

score 1 · Accepted Answer · answered Oct 14 '22 at 16:54

The task of identifying who is talking is called Speaker Identification. Checking whether two audio clips have the same speaker Speaker Verification. If there are multiple speakers in dialog, then it may also be relevant to do Speaker Diarization, finding out who-talks-when. That would enable focus on the interview subject and not the interviewer.

Speaker recognition tasks like these are best solved with a deep neural network, as it is quite difficult task to separate the speaker from the words that are spoken. The models generally output a speaker embedding - a vector representation that encodes similarity of different person's speech. Then one can apply a simple similarity metric on this representation, such as cosine distance.

There are pretrained models available for this. For example in pyannote-audio and in SpeechBrain.

Compare two audio files with persons speaking and compute the similarity score

1 Answers1