1

We have a video library 3k+ files mostly tech conference and townhalls, mono channel, with 1-10 speakers. Now we would like to run the speaker diarization process. We tried to use Batch transcription REST API
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cognitive-services/Speech-Service/batch-transcription.md but it looks like it has a limitation of 2 speakers only. We also investigated the Conversation Transcription service https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/conversation-transcription but it expects Multi-channel audio stream input.

Would you please recommend what Cognitive service tool we can use, if any, to resolve our task?

Thanks!

  • This is an [off-topic question](https://stackoverflow.com/help/on-topic): "4. Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam." Maybe see https://superuser.com – Niklas E. Sep 21 '20 at 14:42

1 Answers1

0

As you noticed batch processing is currently limited to diarization for 2 people only. We expect in November/December to have a new diarization provider utilized in batch that will support 10 speakers on a mono input audio stream.

I don't know any cognitive services tool that would match your requirements right now.

thx Wolfgang

wolfma
  • 426
  • 2
  • 3