1

i'm trying to use Azure Cognitive Services Speech to Text and i am hitting a roadblock in .net Core

i have native support for a WAV file using the audioConfig.FromWafFileInput(); which is great.

however i need to also support MP3's

I have found compressed audio support https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-codec-compressed-audio-input-streams?tabs=debian&pivots=programming-language-csharp

however this is referencing PushAudio Streams.

this is where i'm getting lost....

i have found this example for stream codec compressed audio https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/cpp/linux/compressed-audio-input/compressed-audio-input.cpp

however this is not C# .net core and conversion is not really my strong suit.

so yeah at a bit of a loss.

any assistance would be greatly appreciated (y)

TunedBy
  • 73
  • 1
  • 7

2 Answers2

1

This sample: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/sharedcontent/console/speech_recognition_samples.cs has compressed audio specific methods here and here. The latter pull stream sample seems pretty straightforward, just plug in your key, region, and filepath.

glenn
  • 36
  • 4
  • that is awesome !!! thanks Glenn, not sure if you know, what is the difference between pull and push? – TunedBy Jul 01 '21 at 01:10
  • Push/pull indicates how data flows from the producer to the consumer (in our case, from the audio source to the recognizer instance). For a push stream, as soon as the source (such as an audio file) has data available, it's written to the stream. The consumer can then choose what to do with the new data. For a pull stream, whenever the consumer wants to read data, it's pulled from the source (whether an audio file or a buffer holding audio data). – glenn Jul 02 '21 at 15:43
0

If you have files, especially if you have multiple of them, you can benefit from using batch transcription. It natively supports files in WAV, MP3 and OGG format.

The documentation links to the API documentation, that also includes model customization. Here you can select the region you are interested in and export a swagger file. The swagger file you can use to generate a client in the programming language of your choice.

For your scenario you will only need 4 APIs and you could use the standard HttpClient to execute the requests. You would want to

  • Create a batch transcription.
  • Get your transcriptions to check the state. If it is complete, you get the URL you will need next. If it is failed, you get a message about the problem.
  • Get the results after the batch transcription succeeded. The object with the kind TranscriptionReport contains a list of files that got transcribed, if the transcription was successful and if not, why. The other objects contain the result of the successful transcriptions.
  • (here you need to iterate over the contentUrls, to download the files.)
  • Delete the transcription(s), after you got the results.
D. Siemer
  • 158
  • 8