Create a correct wav audio format to send to Azure Cognitive Service Speech/Translation SDK width MediaRecorder and SignalR

Question

Instead of directly using the Azure Cognitive Services JS SDK on my web page, I need to send my recorded sound to my server through SignalR to apply some logic and then send the audio to Translation SDK. To stream the audio from the client I'm using JS MediaRecorder:

const blobToBase64 = function (blob, callback) {
    const reader = new FileReader();
    reader.onload = function () {
        var dataUrl = reader.result;
        var base64 = dataUrl.split(',')[1];
        callback(base64);
    };
    reader.readAsDataURL(blob);
};

const startUplaoding = async () => {
    const subject = new signalR.Subject();
    const connection = new signalR.HubConnectionBuilder().withUrl("https://.../myHub").build()
    await connection.start();

    //stream to the hub function as chunks of bytes
    await connection.send("UploadStream", subject);

    navigator.mediaDevices.getUserMedia({
        audio: {
            //Azure Speech SDK supported format 
            channelCount: 1,
            sampleSize: 16,
            sampleRate: 16000
        }, video: false
    })
        .then(stream => {
            var audioTrack = stream.getAudioTracks()[0];
            //making sure the constraints are in place
            audioTrack.applyConstraints({
                channelCount: 1,
                sampleSize: 16,
                sampleRate: 16000
            })

            mediaRecorder = new MediaRecorder(stream, {
                mimeType: "audio/webm;codecs=pcm",
            });

            mediaRecorder.addEventListener("dataavailable", e => {
                //convert blob to base64 to send to the SignalR as string
                //Tried sending the blobk directly using "subject.next(e.data)" didn't work
                blobToBase64(e.data, base64 => {
                    subject.next(base64);
                })
            })

            //timeslice = 1000, send every second
            mediaRecorder.start(1000)

            mediaRecorder.addEventListener("stop", () => {
                subject.complete()
            });
        });
}

On my SignalR hub, after receiving each chunk of data and converting it to an array of bytes, I send the chunk to the Azure Translation SDK using in-memory audioInputStream:

//SignalR Hub function
public async Task UploadStream(string sessionId, IAsyncEnumerable<string> stream)
{
    await foreach (var base64Str in stream)
    {
        var chunk = Convert.FromBase64String(base64Str);

        TranslationServiceInstance.PushAudio(chunk);
    }
}


//TranslationService Class
public class TranslationService
{
    private PushAudioInputStream audioInputStream { get; set; }
    private AudioConfig audioConfig { get; set; }
    private TranslationRecognizer recognizer { get; set; }

    public event EventHandler<byte[]> AudioReceived;
    public event EventHandler<string> TextReceived;

    public TranslationService()
    {
        var translationConfig = SpeechTranslationConfig.FromSubscription("###", "REGION");
        translationConfig.SetProperty(PropertyId.Speech_LogFilename, @$"PATH\log.txt");
        translationConfig.SpeechRecognitionLanguage = "en-US";

        translationConfig.AddTargetLanguage("fa");
        translationConfig.VoiceName = "fa-IR-FaridNeural";

        audioInputStream = AudioInputStream.CreatePushStream();
        //been trying to convert to a correct format, but not sure what format the audio chunks are being sent
        //audioInputStream = AudioInputStream.CreatePushStream(AudioStreamFormat.GetWaveFormatPCM(16000, 32, 1));
        audioConfig = AudioConfig.FromStreamInput(audioInputStream);
        recognizer = new TranslationRecognizer(translationConfig, audioConfig);
    }

    public async Task Start()
    {
        recognizer.Recognized += Recognizer_Recognized;
        recognizer.Synthesizing += Recognizer_Synthesizing;
        await recognizer.StartContinuousRecognitionAsync();
    }

    public void PushAudio(byte[] audioChunk)
    {
        //To make sure the audio chunk are being sent correctly
        using (var fs = new FileStream(@$"PATH\sending.wav", FileMode.Append))
        {
            fs.Write(audioChunk, 0, audioChunk.Length);
        }

        audioInputStream.Write(audioChunk, audioChunk.Length);
    }

    private void Recognizer_Synthesizing(object sender, TranslationSynthesisEventArgs e)
    {
        var bytes = e.Result.GetAudio();
        AudioReceived?.Invoke(this, bytes);
    }

    private void Recognizer_Recognized(object sender, TranslationRecognitionEventArgs e)
    {
        TextReceived?.Invoke(this, $"Recognizer_Recognized{e.Result.Text}");

        File.AppendAllText(@"PATH\result.txt", e.Result.Text);
    }
}

The issue is, the Recognizer doesn't detect the sound and complains about the audio format.

Questions:

How to make sure from the browser I’m sending the right format to the SignalR hub?
How do I know what the metadata of the stream is so I can either reject or convert it to the Speech SDK desired format?

My assumption was when sending raw data from the browser to the Recognizer it automatically converts it to the desired format but seems like I'm missing something.

The default input audio format for the Speech SDK TranslationRecognizer is 16khz sample rate, mono, 16-bit/sample (signed), little endian. We do support some other input PCM formats (in which case you will need to call audioInputStream = AudioInputStream.CreatePushStream(AudioStreamFormat.GetWaveFormatPCM(, , )), but streaming the default format from the client is the preferred option. I suggest you dump the audio buffers to a file before sending them into Speech SDK and manually check that the audio is in the right format and not corrupted. — Darren Cohen, Feb 01 '22 at 21:37

Create a correct wav audio format to send to Azure Cognitive Service Speech/Translation SDK width MediaRecorder and SignalR

0 Answers0