IBM Watson STT: How to use Websocket interface with multiple chunks?

Question

I have developed an application for streaming speech recognition in c++ using another API and IBM Watson Speech to Text service API.

In both these programs, I am using the same file which contains this audio

several tornadoes touch down as a line of severe thunderstorms swept through Colorado on Sunday

This file is 641,680 bytes in size and I am sending 100,000 bytes (max) chunks at a time to the Speech to text servers.

Now, with the other API I am able to have everything recognized as a whole. With the IBM Watson API I couldn't. Here is what I have done:

Connect to IBM Watson web server (Speech to text API)
Send start frame {"action":"start","content-type":"audio/mulaw;rate=8000"}
Send binary 100,000 bytes
Send stop frame {"action":"stop"}
...Repeat binary and stop until the last byte.

The IBM Watson Speech API could only recognize the chunks individually
e.g.

several tornadoes touch down
a line of severe thunder
swept through Colorado
Sunday

This seems to be the output of individual chunks and the words coming in between the chunk division (for eg here, "thunderstorm" is partially present in the end of a chunk and partially in the starting of the next chunk) are thus incorrectly recognized or dropped.

What am I doing wrong?

EDIT (I am using c++ with boost library for websocket interface)

//Do the websocket handshake 
void IbmWebsocketSession::on_ssl_handshake(beast::error_code ec) {

    auto mToken = mSttServiceObject->GetToken(); // Get the authentication token

    //Complete the websocket handshake and call back the "send_start" function
    mWebSocket.async_handshake_ex(mHost, mUrlEndpoint, [mToken](request_type& reqHead) {reqHead.insert(http::field::authorization,mToken);},
            bind(&IbmWebsocketSession::send_start, shared_from_this(), placeholders::_1));
}

//Sent the start frame
void IbmWebsocketSession::send_start(beast::error_code ec) {

    //Send the START_FRAME and call back the "read_resp" function to receive the "state: listening" message
    mWebSocket.async_write(net::buffer(START_FRAME),
            bind(&IbmWebsocketSession::read_resp, shared_from_this(), placeholders::_1, placeholders::_2));
}

//Sent the binary data
void IbmWebsocketSession::send_binary(beast::error_code ec) {

    streamsize bytes_read = mFilestream.rdbuf()->sgetn(&chunk[0], chunk.size()); //gets the binary data chunks from a file (which is being written at run time

    // Send binary data
    if (bytes_read > mcMinsize) {  //Minimum size defined by IBM  is 100 bytes.
                                   // If chunk size is greater than 100 bytes, then send the data and then callback "send_stop" function
        mWebSocket.binary(true);

        /**********************************************************************
         *  Wait a second before writing the next chunk.
         **********************************************************************/
        this_thread::sleep_for(chrono::seconds(1));

        mWebSocket.async_write(net::buffer(&chunk[0], bytes_read),
                bind(&IbmWebsocketSession::send_stop, shared_from_this(), placeholders::_1));
    } else {                     //If chunk size is less than 100 bytes, then DO NOT send the data only call "send_stop" function
        shared_from_this()->send_stop(ec);
    }

}

void IbmWebsocketSession::send_stop(beast::error_code ec) {

    mWebSocket.binary(false);
    /*****************************************************************
     * Send the Stop message
     *****************************************************************/
    mWebSocket.async_write(net::buffer(mTextStop),
            bind(&IbmWebsocketSession::read_resp, shared_from_this(), placeholders::_1, placeholders::_2));
}

void IbmWebsocketSession::read_resp(beast::error_code ec, size_t bytes_transferred) {
    boost::ignore_unused(bytes_transferred);
        if(mWebSocket.is_open())
        {
            // Read the websocket response and call back the "display_buffer" function
            mWebSocket.async_read(mBuffer, bind(&IbmWebsocketSession::display_buffer, shared_from_this(),placeholders::_1));
        }
        else
            cerr << "Error: " << e->what() << endl;

}

void IbmWebsocketSession::display_buffer(beast::error_code ec) {

    /*****************************************************************
     * Get the buffer into stringstream
     *****************************************************************/
    msWebsocketResponse << beast::buffers(mBuffer.data());

    mResponseTranscriptIBM = ParseTranscript(); //Parse the response transcript

    mBuffer.consume(mBuffer.size()); //Clear the websocket buffer

    if ("Listening" == mResponseTranscriptIBM && true != mSttServiceObject->IsGstFileWriteDone()) { // IsGstFileWriteDone -> checks if the user has stopped speaking
        shared_from_this()->send_binary(ec);
    } else {
        shared_from_this()->close_websocket(ec, 0);
    }
}

@data_henrik Well sure! I can share the code, but I don't think this is a coding issue. My guess is either this is the functionality of IBM's API or I am *logically* doing something wrong. Although I will have to make some changes to follow my organization's policies. So it may take some time to do the editing. — RC0993, May 27 '19 at 06:53

data_henrik · Answer 1 · 2019-05-27T07:29:28.677

0

IBM Watson Speech to Text has several APIs to transmit audio and receive transcribed text. Based on your description you seem to use the WebSocket Interface.

For the WebSocket Interface, you would open the connection (start), then send individual chunks of data, and - once everything has been transmitted - stop the recognition request.

You have not shared code, but it seems you are starting and stopping a request for each chunk. Only stop after the last chunk.

I would recommend to take a look at the API doc which contains samples in different languages. The Node.js sample shows how to register for events. There are also examples on GitHub like this WebSocket API with Python. And here is another one that shows the chunking.

edited May 27 '19 at 07:29

answered May 27 '19 at 07:01

data_henrik

16,724
2
28
49

Well then! how can I use that as real-time transcribing? I am following [the Second example exchange](https://cloud.ibm.com/docs/services/speech-to-text?topic=speech-to-text-websockets#secondExample). sending continuous chunks. – RC0993 May 27 '19 at 07:06
Hey @data_henrik I have added the code. It is in C++. I have checked the python code with nursery rhyme example. I am not sure if I understood it completely but it seems like the data is sent in chunks (true) but all of it is sent at once and then at the end "stop" frame is sent. _PS:- I am not a python pro_ – RC0993 May 27 '19 at 09:16
The code isn't showing the driver or flow, only some function definitions. – data_henrik May 27 '19 at 09:42
The flow is how I mention in the original question. `...START FRAME >> binary data >> STOP FRAME >> binary data >> STOP FRAME >> binary data >> ... >> STOP FRAME` only after the start and each stop frame I am reading the websocket response – RC0993 May 27 '19 at 09:45
I am not sure what your remaining question. The flow is wrong, but you are not showing it. – data_henrik May 27 '19 at 09:54
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193991/discussion-between-rc0993-and-data-henrik). – RC0993 May 27 '19 at 09:55
@data_henrik is correct, the flow is wrong, it should be: `...START FRAME >> binary data >> binary data >> binary data >> ... >> STOP FRAME` – Daniel Bolanos May 28 '19 at 01:51
The flow that you mentioned **certainly works**, but doesn't give a real-time conversion that is what the question is about (The comparison between _(other API)_ Google and IBM Watson). I am looking for each chunk to be recognised at the earliest and not wait untill the whole binary stream ends, Sir. – RC0993 May 28 '19 at 04:06
Have you tried it? See the examples. Watson STT immediately sends back recognized text and in some cases corrects it when more context is present. – data_henrik May 28 '19 at 05:19

score 0 · Answer 2 · answered May 28 '19 at 01:52

0

@data_henrik is correct, the flow is wrong, it should be: ...START FRAME >> binary data >> binary data >> binary data >> ... >> STOP FRAME

you only need to send the {"action":"stop"} message when there are no more audio chunks to send

answered May 28 '19 at 01:52

Daniel Bolanos

770
3
6

IBM Watson STT: How to use Websocket interface with multiple chunks?

2 Answers2