We have a video library 3k+ files mostly tech conference and townhalls, mono channel, with 1-10 speakers. Now we would like to run the speaker diarization process.
We tried to use Batch transcription REST API
https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/cognitive-services/Speech-Service/batch-transcription.md
but it looks like it has a limitation of 2 speakers only.
We also investigated the Conversation Transcription service https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/conversation-transcription
but it expects Multi-channel audio stream input.
Would you please recommend what Cognitive service tool we can use, if any, to resolve our task?
Thanks!