Audio analysis for voice, gender diarization/recognition

Question

Does anyone know a library, program, project, etc. that tries to determine how many speakers were active in an audio file, label each speaker, label its gender, etc.?

So far I found the following:

Have you check out `Project Oxford` - part of the `Microsoft Cognitive Services` - they've produced `Emotion` and a `Speaker Recognition` sdk etc. That might get you started. — brandall, Apr 20 '16 at 13:37
@Aley please tell what worked for you. I tried pyaudioanalysis but it fails miserably at separating female-female and male-female . — DJ_Stuffy_K, Jan 31 '18 at 19:21

score 1 · Answer 1 · answered Apr 28 '18 at 06:28

The task of identifying how many people are there and assigning segments to speakers in an audio file is known as speaker diarization. Using this keyword for search you can find lots of research papers and some libraries in python. Most of the current research use deep learning models, typically RNN, to generate embeddings and then cluster them into different chunks, ideally which belong to different speakers. It is a difficult task, especially if your files are noisy. I didn't find any library/tool which was very accurate. Even IBM's API is not that accurate.

We have developed some Deep learning models on our own for this task which are exposed through API's. You can take a look at https://developers.deepaffects.com/ for more info. We also have gender and emotion recognition API's.

Disclosure - I work at deepaffects

Audio analysis for voice, gender diarization/recognition

1 Answers1