1

I want to do a project of speech-to-text analysis where I would like to 1) Speaker recognition 2) Speaker diarization 3)Speech-to-text. Right now I am testing various APIs provided for various companies like Microsoft, Google, AWS, IBM etc I could find in Microsoft you have the option for user enrollment and speaker recognition (https://cognitivewuppe.portal.azure-api.net/docs/services/563309b6778daf02acc0a508/operations/5645c3271984551c84ec6797) However, all other platforms do have speaker diarization but not speaker recognition. In speaker diarization if I understand correctly it will be able to "distinguish" between users but how will it recognize unless until I don't enrol them? I could find only enrollment option available in azure

But I want to be sure so just want to check here maybe i am looking at correct documents or maybe there is some other way to achieve this in Google cloud, Watson and AWS transcribe. If that is the case can you folks please assist me with that

1 Answers1

2

Speaker Recognition is divided into two categories: speaker verification and speaker identification. https://learn.microsoft.com/en-us/azure/cognitive-services/speaker-recognition/home

Diarization is the process of separating speakers in a piece of audio. Our Batch pipeline supports diarization and is capable of recognizing two speakers on mono channel recordings. When you use batch transcription api and enable diarization. It will return 1,2. All transcription output contains a SpeakerId. If diarization is not used, it will show "SpeakerId": null in the JSON output. For diarization we support two voices, so the speakers will be identified as "1" or "2". https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cognitive-services/Speech-Service/batch-transcription.md

Ex: In a call center scenario the customer does not need to identify who is speaking, and cannot train the model beforehand with speaker voices since a new user calls in every time. Rather they only need to identify different voices when converting voice to text.

or

You can use Video Indexer supports transcription, speaker diarization (enumeration), and emotion recognition both from the text and the tone of the voice. Additional insights are available as well e.g. topic inference, language identification, brand detection, translation, etc. You can consume it via the video or audio-only APIs for COGS optimization. You can use VI for speaker diarization. When you get the insights JSON, you can find speaker IDs both under Insights.transcript[0].speakerId as well as under Insights.Speakers. When dealing with audio files, where each speaker is recoded on a different channel, VI identifies that and applies the transcription and diarization accordingly.

Ram
  • 2,459
  • 1
  • 7
  • 14