Google Cloud Speech: Distinguish Voices?

Question

I am interested in writing a voice recognition application that is aware of multiple speakers. For example if Bill, Joe, and Jane are talking then the application could not only recognize sounds as text but also classify the results by speaker (say 0, 1 and 2... because obviously/hopefully Google has no means of linking voices to people).

I am hunting for speech recognition APIs that might do this, and Google Cloud Speech comes up as a top ranked API. I have looked through the API docs to see if such functionality is available, and have not found it.

My question is: does/will this functionality exist?

Note: Google's support page says their engineers sometimes answer these questions on SO, so it seems plausible someone might have an answer to the "will" part of the question.

They recommend to record speakers' voices each to a separate channel. — mikalai, Apr 25 '18 at 08:50
I saw that the functionnality is now available in Beta mode in [Google Cloud Speech-to-Text](https://cloud.google.com/speech-to-text/). — OmarQ, May 06 '19 at 11:11

score 5 · Answer 1 · answered Oct 04 '17 at 21:40

IMB's speech to text service does it. If you use their rest service its very simple, just add that you want different speakers identified in the url param. Documentation for it here (https://console.bluemix.net/docs/services/speech-to-text/output.html#speaker_labels)

it works kind of like this:

 curl -X POST -u {username}:{password}
--header "Content-Type: audio/flac"
--data-binary @{path}audio-multi.flac
"https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?model=en-US_NarrowbandModel&speaker_labels=true"

then it will return a json with the results and speaker labels like this :

{
 "results": [
    {
      "alternatives": [
        {
          "timestamps": [
            [
              "hello",
              0.68,
              1.19
            ],
            [
              "yeah",
              1.47,
              1.93
            ],
            [
              "yeah",
              1.96,
              2.12
            ],
            [
              "how's",
              2.12,
              2.59
            ],
            [
              "Billy",
              2.59,
              3.17
            ],
            . . .
          ]
          "confidence": 0.821,
          "transcript": "hello yeah yeah how's Billy "
        }
      ],
      "final": true
    }
  ],
  "result_index": 0,
  "speaker_labels": [
    {
      "from": 0.68,
      "to": 1.19,
      "speaker": 2,
      "confidence": 0.418,
      "final": false
    },
    {
      "from": 1.47,
      "to": 1.93,
      "speaker": 1,
      "confidence": 0.521,
      "final": false
    },
    {
      "from": 1.96,
      "to": 2.12,
      "speaker": 2,
      "confidence": 0.407,
      "final": false
    },
    {
      "from": 2.12,
      "to": 2.59,
      "speaker": 2,
      "confidence": 0.407,
      "final": false
    },
    {
      "from": 2.59,
      "to": 3.17,
      "speaker": 2,
      "confidence": 0.407,
      "final": false
    },
    . . .
  ]
}

they also have web socket options and SDKs for different platforms that will access this, no just rest calls.

good luck

score 4 · Accepted Answer · answered Feb 02 '17 at 02:22

4

I know of no current provider that does this as an inbuilt part of their Speech Recognition API.

I've used Microsoft Cognitive Services - Speaker Recognition API for something similar, but the audio is provided to the API separately to use of their Speech Recognition API.

Being able to combine the two would be useful.

answered Feb 02 '17 at 02:22

brandall

6,094
4
49
103

thanks for the links, I may be able to figure something out with those leads. I'm going to let the fallacy "you can't prove a negative" apply here and wait 2 days to see if anyone comes up with an "actual" solution. If after 2 days no "actual" solution is posted, I will mark this as the accepted answer. – Feb 02 '17 at 19:13
I lied. I was 5 days late. Pathological optimism... the bane of programmers :) – Feb 09 '17 at 23:28
1

@Paul Thank you :) If I stumble across anything else, I'll return and update my answer. – brandall Feb 09 '17 at 23:29
Google now offers this in beta mode: https://cloud.google.com/speech-to-text/docs/multiple-voices – Evan Knowles Jun 30 '20 at 10:11

score 3 · Answer 3 · answered Jul 24 '18 at 02:29

3

There is big difference between Speaker Identification and Speaker Differentiation. Most of the cloud AI platform mainly does the Speaker Differentiation. But Nuance is the only company claim to provide Speaker Identification, but you need to purchase their license. https://www.nuance.com/en-nz/omni-channel-customer-engagement/security/multi-modal-biometrics.html

answered Jul 24 '18 at 02:29

DSBLR

555
5
9

Azure now also offers this service ($10/1k requests) in addition to speaker verification ($5 per 1k requests) plus a nominal charge for storing profiles. Free tier offers 10k requests per month, but from what I can tell you're forced off free services after 30 days (I may be wrong about that though). Edit: Also looks like nuance doesn't anymore, link above is 404 and no mention in their list of services. – Synexis Aug 09 '23 at 01:16

Grokify · Answer 4 · 2019-10-08T19:28:55.213

Microsoft now does Speaker Identification as part of Conversation Transcription which combines real-time speech recognition, speaker identification, and diarization. This is an advanced feature of their Speech Services. This is described here:

https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/conversation-transcription-service

There are 3 steps:

Collect voice samples from users.
Generate user profiles using the user voice samples
Use the Speech SDK to identify users (speakers) and transcribe speech

This is shown in the following diagram from the page:

This is currently limited to en-US and zh-CN in the following regions: centralus and eastasia.

score -5 · Answer 5 · answered Oct 31 '17 at 18:56

Google has recently released the ability to access user location, name, and a unique ID for the user in your apps.

The documentation can be find at: https://developers.google.com/actions/reference/nodejs/AssistantApp#getUser

Example to get user's name using getUserName:

const app = new DialogflowApp({request: req, response: res});
const REQUEST_PERMISSION_ACTION = 'request_permission';
const SAY_NAME_ACTION = 'get_name';

function requestPermission (app) {
const permission = app.SupportedPermissions.NAME;
 app.askForPermission('To know who you are', permission);
}

function sayName (app) {
  if (app.isPermissionGranted()) {
    app.tell('Your name is ' + app.getUserName().displayName));
  } else {
    // Response shows that user did not grant permission
    app.tell('Sorry, I could not get your name.');
  }
}
const actionMap = new Map();
actionMap.set(REQUEST_PERMISSION_ACTION, requestPermission);
actionMap.set(SAY_NAME_ACTION, sayName);
app.handleRequest(actionMap);

That's not what the OP is asking. – brendan Nov 14 '17 at 17:25 — brendan, Nov 14 '17 at 17:25

Google Cloud Speech: Distinguish Voices?

5 Answers5