0

I want to perform real-time speech recognition for the Hololens 2 with Unity 2021 and I am using the Microsoft Azure Cognitive Services Speech SDK to do so. Instead of the default Hololens 2 microphone stream, I want to switch to the Stream Category "room capture", for which I must use the Windows Microphone Stream (see link). The Windows Microphone Stream initialization and starting also succeeds with this code:

    //create windows mic stream
    micStream = new WindowsMicrophoneStream();
    if (micStream == null)
    {
        Debug.Log("Failed to create the Windows Microphone Stream object");
    }

    //init windows mic stream
    WindowsMicrophoneStreamErrorCode result = micStream.Initialize(streamType);
    if (result != WindowsMicrophoneStreamErrorCode.Success)
    {
        Debug.Log($"Failed to initialize the microphone stream. {result}");
        return;
    }
    else Debug.Log($"Initialized the microphone stream. {result}");

    // Start the microphone stream.
    result = micStream.StartStream(true, false);
    if (result != WindowsMicrophoneStreamErrorCode.Success)
    {
        Debug.Log($"Failed to start the microphone stream. {result}");
    }
    else Debug.Log($"Started the microphone stream. {result}");

I don't really have much knowledge concerning audio streams, but I guess for the Speech SDK to get the room capture, I have to feed it with this mic stream. My problem is that I have not found a way to do that. I guess that I would probably have to implement my own PullAudioInputStreamCallback class (as e.g. here), but I don't know how Read() should be implemented for the Windows Microphone Stream. Additionally, I considered to use a PushStream like so:

        SpeechConfig speechConfig = SpeechConfig.FromSubscription(SpeechController.Instance.SpeechServiceAPIKey, SpeechController.Instance.SpeechServiceRegion);
        speechConfig.SpeechRecognitionLanguage = fromLanguage;
        using (var pushStream = AudioInputStream.CreatePushStream())
        {
            using (var audioInput = AudioConfig.FromStreamInput(pushStream))
            {
                using (var recognizer = new SpeechRecognizer(speechConfig, audioInput))
                {
                    recognizer.Recognizing += RecognizingHandler;
                    ...

                    await recognizer.StartContinuousRecognitionAsync().ConfigureAwait(false);

                    // The "MicStreamReader" is not implemented! 
                    using (MicStreamReader reader = new MicStreamReader(MicStream))
                    {
                        byte[] buffer = new byte[1000];
                        while (true)
                        {
                            var readSamples = reader.Read(buffer, (uint)buffer.Length);
                            if (readSamples == 0)
                            {
                                break;
                            }
                            pushStream.Write(buffer, readSamples);
                        }
                    }
                    pushStream.Close();
                }
            }
        }

But I would need something like a "MicStreamReader" in this code. Could you help me with this approach or do you know a better one?

Leado
  • 25
  • 6

1 Answers1

2

I would suggest the following steps:

  1. Use https://github.com/microsoft/MixedRealityToolkit-Unity/blob/htk_release/Assets/HoloToolkit-Examples/Input/Scripts/MicStreamDemo.cs as a base where you create the MicStream with the desired stream category and then read the audio frames using MicStream.MicGetFrame in OnAudioFilterRead callback method.

  2. Modify the sample (1) and create there also Speech SDK's SpeechRecognizer with PushAudioStream configuration. Then write to the Speech SDK's push stream in OnAudioFilterRead callback method for each audio frame read. Now as MicStream.MicGetFrame returns audios in floats, you need to convert them to 16bit pcm before writing to SDK. For float to pcm conversion example, please check the following sample which uses Unity microphone to capture the audio and write it to Speech SDK using pushstream https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/csharp/unity/from-unitymicrophone/Assets/Scripts/HelloWorld.cs.

  • Hi, thank you so much! That was already very helpful, though I'm still stuck at one point: Currently, in OnAudioFilterRead I call MicStream.MicGetFrame(buffer, ...), directly convert the buffer using ConvertAudioClipDataToInt16ByteArray and then write the resulting bytes with pushStream.Write(). However no speech is recognized by the Speech SDK (no errors). Do you have an idea why? – Leado Jul 13 '22 at 15:54
  • My buffer is filled with zeros even after MicStream.MicGetFrame(), so I guess I have a problem similar to [here](https://githubmemory.com/repo/microsoft/MixedRealityToolkit-Unity/issues/9717) and that it has nothing to do with your solution :) – Leado Jul 13 '22 at 16:45
  • Please verify first you can get audios recorded correctly before writing into Speech SDK. Could you e.g. dump data to file (e.g. raw pcm or wave) and verify it contains proper audio ? – Jarno Hakulinen Jul 13 '22 at 21:57
  • Yes, I could verify the audio using the micdemos audio recording methods. The .wavs are flawless. – Leado Jul 14 '22 at 10:06
  • Could you please verify what is the format of audio? Speech SDK expects audio by default in 16kHz, 16bit, mono format. For example, if the sample rate what you get is 48kHz and mono you need then construct PushAudioInputStream with AudioStreamFormat with 48kHz, 16bit, 1 channel, see https://learn.microsoft.com/en-us/dotnet/api/microsoft.cognitiveservices.speech.audio.pushaudioinputstream.-ctor?view=azure-dotnet#microsoft-cognitiveservices-speech-audio-pushaudioinputstream-ctor(microsoft-cognitiveservices-speech-audio-audiostreamformat) – Jarno Hakulinen Jul 14 '22 at 14:11
  • The push audio input stream has the default format (I'm using GetDefaultOutputFormat()) and I set the sample rate of the micStream to 16k in MicInitializeCustomRate(). However, I found that the number of channels in OnAudioFilterRead is 2 (also set to two in the [stream selectors] (https://github.com/microsoft/MixedRealityToolkit/blob/main/Input/MicStreamSelector/Source/MicStreamSelector.cpp) MicInitializeCustomRateWithGraph at line 155). I wrote a split method to only select one of the channels, but that didn't help with the main problem, as MicGetFrame always returns a zeros buffer. – Leado Jul 14 '22 at 20:10
  • Update: I'm not sure how that was a problem, but the "Input Gain" was initially set to 1, I set it to 2 (or anything else) in the Unity Editor and it worked. I reset it to 1 and it still worked. I don't remember if I modifed something else, but either way I think it is very strange behaviour. Anyway, thank you for helping me! – Leado Aug 04 '22 at 14:10