Detect multiple voices without speech recognition

Question

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?

I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...

I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.

Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.

score 2 · Answer 1 · answered Jul 11 '16 at 22:49

Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:

Take some amount of clean recordings, you can download audiobook
Prepare mixed data by mixing clean recordings
Train GMM classifier on both
Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.

You can find some code samples here:

https://github.com/littleowen/Conceptor

For example you can try

https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

Thanks, this looks very good. I think I understand the idea behind it and will try to run it. Ideally the program would need to the train GMM classifier on present voices when started. I haven't used python yet but I'll give it a try. — Tobias Philipp, Jul 13 '16 at 05:32

Detect multiple voices without speech recognition

1 Answers1