How to hook real-time audio stream endpoint to Direct Line Speech Endpoint?

Question

I am trying to hook-up my real time audio endpoint which produces continuous audio stream with Direct Line Speech (DLS) endpoint which eventually interacts with my Azure bot api.

I have a websocket API that continuously receives audio stream in binary format and this is what I intend to forward it to the DLS endpoint for continuous Speech2Text with my bot.

Based on the feedback and answer here, I have been able to hook up my Direct Line speech endpoint with a real-time stream.

I've tried a sample wav file which correctly gets transcribed by DLS and my bot is correctly able to retrieve the text to operate on it.

I have used the ListenOnce() API and am using a PushAudioInputStream method to push the audio stream to the DLS speech endpoint.

The below code is internals of ListenOnce() method

// Create a push stream
using (var pushStream = AudioInputStream.CreatePushStream())
{
    using (var audioInput = AudioConfig.FromStreamInput(pushStream))
    {
        // Create a new Dialog Service Connector
        this.connector = new DialogServiceConnector(dialogServiceConfig, audioInput);
        // ... also subscribe to events for this.connector

        // Open a connection to Direct Line Speech channel
        this.connector.ConnectAsync();
        Debug.WriteLine("Connecting to DLS");

        pushStream.Write(dataBuffer, dataBuffer.Length);

        try
        {
            this.connector.ListenOnceAsync();
            System.Diagnostics.Debug.WriteLine("Started ListenOnceAsync");
        }
    }
}

dataBuffer in above code is the 'chunk' of binary data I've received on my websocket.

const int maxMessageSize = 1024 * 4; // 4 bytes
var dataBuffer = new byte[maxMessageSize];

while (webSocket.State == WebSocketState.Open)
{
    var result = await webSocket.ReceiveAsync(new ArraySegment<byte>(dataBuffer), CancellationToken.None);
    if (result.MessageType == WebSocketMessageType.Close)
    {
        Trace.WriteLine($"Received websocket close message: {result.CloseStatus.Value}, {result.CloseStatusDescription}");
        await webSocket.CloseAsync(result.CloseStatus.Value, result.CloseStatusDescription, CancellationToken.None);
    }
    else if (result.MessageType == WebSocketMessageType.Text)
    {
        var message = Encoding.UTF8.GetString(dataBuffer);
        Trace.WriteLine($"Received websocket text message: {message}");
    }
    else // binary
    {
        Trace.WriteLine("Received websocket binary message");
        ListenOnce(dataBuffer); //calls the above 
    }
}

But the above code doesn't work. I believe I have couple of issues/questions with this approach -

I believe I am not correctly chunking the data to Direct Line Speech to ensure that it receives full audio for correct S2T conversion.
I know DLS API supports ListenOnceAsync() but not sure if this supports ASR (it knows when the speaker on other side stopped talking)
Can I just get the websocket url for the Direct Line Speech endpoint and assume DLS correctly consumes the direct websocket stream?

I'm a little confused because you're saying you're using a "PullAudioInputStreamCallback" method to "push" the audio stream and then in the code I see you creating a push stream. Can you show where you're using this pull stream? — Kyle Delaney, Oct 10 '19 at 15:49
My bad, I am using PushAudioInputStream. I was trying Pull sometime ago which didn't work as well. — bedtym, Oct 11 '19 at 02:44
I'm noticing a few potential problems here. You're calling `ConnectAsync` without awaiting it, so that task might not complete by the time you call `ListenOnceAsync` (also without awaiting). Then you're disposing both the input stream and audio config before giving either of those asynchronous methods time to complete. Could that be your problem? You just say the code doesn't work, so I don't know if you're seeing an error message or what. — Kyle Delaney, Oct 23 '19 at 18:22

Kyle Delaney · Answer 1 · 2019-10-25T23:13:31.443

1

I believe I am not correctly chunking the data to Direct Line Speech to ensure that it receives full audio for correct S2T conversion.

DialogServiceConnector.ListenOnceAsync will listen until the stream is closed (or enough silence is detected). You are not closing your stream except for when you dispose of it at the end of your using block. You could await ListenOnceAsync but you'd have to make sure you close the stream first. If you don't await ListenOnceAsync then you can close the stream whenever you want, but you should probably do it as soon as you finish writing to the stream and you have to make sure you don't dispose of the stream (or the config) before ListenOnceAsync has had a chance to complete.

You also want to make sure ListenOnceAsync gets the full utterance. If you're only receiving 4 bytes at a time then that's certainly not a full utterance. If you want to keep your chunks to 4 bytes then it may be a good idea to keep ListenOnceAsync running during multiple iterations of that loop rather than calling it over and over for every 4 bytes you get.

I know DLS API supports ListenOnceAsync() but not sure if this supports ASR (it knows when the speaker on other side stopped talking)

I think you will have to determine when the speaker stops talking on the client side and then receive a message from your WebSocket indicating that you should close the audio stream for ListenOnceAsync.

It looks like ListenOnceAsync does support ASR.

Can I just get the websocket url for the Direct Line Speech endpoint and assume DLS correctly consumes the direct websocket stream?

You could try it, but I would not assume that myself. Direct Line Speech is still in preview and I don't expect compatibility to come easy.

edited Oct 25 '19 at 23:13

answered Oct 23 '19 at 20:45

Kyle Delaney

11,616
6
39
66

how can I determine if the speaker stops talking? – bedtym Oct 24 '19 at 05:05
@bedtym - If you're in charge of the client code too then you can do whatever you want. Detect a certain amount of silence, detect keywords like "over," have the user press a button, have every listening session last a specific amount of time, etc. It's your app. – Kyle Delaney Oct 24 '19 at 17:52
I understand that I can do this from client side, I was asking if anything can be configured on DLS side. If I am going to do ASR then it's as good as re-building DLS myself. – bedtym Oct 25 '19 at 01:42
@bedtym - That's a good point. I did some more digging and I've discovered that the DialogServiceConnector does indeed fire the SessionStopped event on its own when the speaker stops speaking. So I guess it's just up to you to know when to call ListenOnceAsync (or StartKeywordRecognitionAsync). Is this answer acceptable? – Kyle Delaney Oct 25 '19 at 23:07
That was my observed behavior too, and I tried using it but the problem is the ASR on the DirectLineSpeech is too sensitive. Let's say I am saying something and pause for just a sec to think and continue, the ASR on DLS side has already stopped the session, and the bot has moved on to the next workflow. I am not sure how to work around that fact Makes sense? – bedtym Oct 26 '19 at 00:22
For the purpose of the original question, what answer you provided is complete, but I still am stuck making it the DLS work. – bedtym Oct 26 '19 at 00:22
Based on [this comment](https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/323#issuecomment-517808050), it looks like there's currently no way to adjust the amount of silence needed to end an utterance. Hopefully that will change in the future. If this answer is acceptable then please accept it. – Kyle Delaney Oct 28 '19 at 19:27

How to hook real-time audio stream endpoint to Direct Line Speech Endpoint?

1 Answers1

Linked