2

Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?

I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...

I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.

Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.

Tobias Philipp
  • 93
  • 1
  • 13

1 Answers1

2

Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:

  1. Take some amount of clean recordings, you can download audiobook
  2. Prepare mixed data by mixing clean recordings
  3. Train GMM classifier on both
  4. Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.

You can find some code samples here:

https://github.com/littleowen/Conceptor

For example you can try

https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87
  • Thanks, this looks very good. I think I understand the idea behind it and will try to run it. Ideally the program would need to the train GMM classifier on present voices when started. I haven't used python yet but I'll give it a try. – Tobias Philipp Jul 13 '16 at 05:32