Speech Recognition performance in UWP

Question

I noticed that when the Windows 10 CU device (from PC to Raspberry) is not connected to any network, the speech recognition with a local grammar file (for example ContinuousRecognitionSession scenario) performance is remarkable: less than 10 milliseconds, but when the device is connected to a Wi-Fi network performance drops by about 10 times! And if the Wi-Fi network does not have access to the Internet, performance is reduced by more than 2 times!

More in detail the time it expands is between the beginning of the speechRecognizer “SoundEnded” state and when the
speechRecognizer.ContinuousRecognitionSession.ResultGenerated
event is invoked.

Is there a way to have the full speed in speech-recognition even when the device is connected to a network?

Even the official Microsoft UWP sample accuses the same problem:

Thanks for any idea :)

UPDATE1:

I have seen that if the internet connection is active, the app sends a request to this URL: websockets.platform.bing.com:443 ... if it does not respond it retires another 19 times for each speech recognition stage, however If the server responds, the app will no longer send requests to the remote server because it keeps the first https session active all the time.

UPDATE2:

I found that if the connection to the server happens, the app sends every detail of voice recognition session to the server (by https): everything is pronounced, semantics, confidence scores ...but also: the device type and model, geolocation and much much more! The schema is JSON.
But that's not all, the device sends to the server, even the binary audio stream, and the server responds with the json intermediate hypotheses until the stream is completed and, at the end, the server sends the final response.

In details...

1. at the beginning of session, the device identifies itself with a packet like this:

X-CU-ApplicationId: FB31CC89-63D2-4296-A806-33DBA8DA56F2 X-CU-ClientVersion: 3.0.150531 X-CU-ConversationId: 01aef68ebfd84a478d70cf5155ba7e82 X-CU-Locale: en-US X-CU-LogLevel: 1 X-CU-RequestId: 4885e9c5-a2be-4167-b27c-5ad55024dcb4 X-LOBBY-MESSAGE-TYPE: connection.context X-Search-IG: 37226844f90542e6b96b5c27891eea97 X-WebSocketMessageId: C#116

{
  "Groups": {
    "ConversationContext": {
      "Id": "ConversationContext",
      "Info": {
        "PreferClientReco": "false",
        "TurnId": "0"
      },
      "Items": []
    },
    "LocalProperties": {
      "Id": "LocalProperties",
      "Info": {
        "AudioSourceType": "None",
        "CurrentTime": "2017-06-07T10:01:47+02:00",
        "DrivingModeActive": "false",
        "GeoLocation": "{\"Uri\":\"entity:\/\/GeoCoordinates\",\"Version\":\"1.0\",\"Latitude\":45.48179547076163,\"Longitude\":9.18281614780426,\"Accuracy\":64}",
        "InCall": "false",
        "IsActiveDisplayHMD": "false",
        "LockState": "Invalid",
        "MicrophoneInfo": "audio stream (xxx)",
        "ModeOfTravel": "Undefined",
        "ProximitySensorState": "Invalid",
        "SpeechAppInitiatedRequest": "false",
        "SystemInfo": "{\"DeviceMake\":\"Microsoft Corporation\",\"DeviceModel\":\"Surface Pro 2\",\"DeviceFamily\":\"Windows.Desktop\",\"OsVersion\":\"6.3\",\"Qfe\":3145728,\"Branch\":\"rs2_release\",\"LanguageCode\":1033,\"Protocol\":\"1.0\",\"OsName\":\"Windows 10 Enterprise\",\"TimeZone\":\"W. Europe Standard Time\",\"RegionalFormatCode\":\"it-IT\",\"Mkt\":\"en-US\",\"CortanaEnabled\":false,\"NonNativeSpeech\":false,\"TestHook\":false}",
        "TetheredDeviceMake": "",
        "TetheredDeviceModel": "",
        "UserAgeClass": "Adult"
      },
      "Items": [{
        "DisplayText": "Adult",
        "Id": "UserAgeClass",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "UserAgeClass"
      }, {
        "DisplayText": "false",
        "Id": "SpeechAppInitiatedRequest",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "SpeechAppInitiatedRequest"
      }, {
        "DisplayText": "Undefined",
        "Id": "ModeOfTravel",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "ModeOfTravel"
      }, {
        "DisplayText": "audio stream (xxx)",
        "Id": "MicrophoneInfo",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "MicrophoneInfo"
      }, {
        "DisplayText": "Invalid",
        "Id": "LockState",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "LockState"
      }, {
        "DisplayText": "Invalid",
        "Id": "ProximitySensorState",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "ProximitySensorState"
      }, {
        "DisplayText": "{\"DeviceMake\":\"Microsoft Corporation\",\"DeviceModel\":\"Surface Pro 2\",\"DeviceFamily\":\"Windows.Desktop\",\"OsVersion\":\"6.3\",\"Qfe\":3145728,\"Branch\":\"rs2_release\",\"LanguageCode\":1033,\"Protocol\":\"1.0\",\"OsName\":\"Windows 10 Enterprise\",\"TimeZone\":\"W. Europe Standard Time\",\"RegionalFormatCode\":\"it-IT\",\"Mkt\":\"en-US\",\"CortanaEnabled\":false,\"NonNativeSpeech\":false,\"TestHook\":false}",
        "Id": "SystemInfo",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "SystemInfo"
      }, {
        "DisplayText": "false",
        "Id": "InCall",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "InCall"
      }, {
        "DisplayText": "",
        "Id": "TetheredDeviceMake",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "TetheredDeviceMake"
      }, {
        "DisplayText": "{\"Uri\":\"entity:\/\/GeoCoordinates\",\"Version\":\"1.0\",\"Latitude\":45.48179547076163,\"Longitude\":9.18281614780426,\"Accuracy\":64}",
        "Id": "GeoLocation",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "GeoLocation"
      }, {
        "DisplayText": "",
        "Id": "TetheredDeviceModel",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "TetheredDeviceModel"
      }, {
        "DisplayText": "false",
        "Id": "DrivingModeActive",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "DrivingModeActive"
      }, {
        "DisplayText": "None",
        "Id": "AudioSourceType",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "AudioSourceType"
      }, {
        "DisplayText": "false",
        "Id": "IsActiveDisplayHMD",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "IsActiveDisplayHMD"
      }, {
        "DisplayText": "2017-06-07T10:01:47+02:00",
        "Id": "CurrentTime",
        "Info": {},
        "Items": [],
        "KnownAs": [],
        "Name": "CurrentTime"
      }]
    },
    "RecoProperties": {
      "Id": "RecoProperties",
      "Info": {
        "ClientType": "3P",
        "KeywordStreamed": "false",
        "ModelRevision": "1",
        "OptIn": "true",
        "Scenario": "WP_GSE",
        "UserAgeClass": "Adult"
      },
      "Items": []
    }
  },
  "OnScreenItems": [],
  "SrScenario": "WP_GSE"
}

2. the device send all binary packets containing the audio stream

EncodingFormat: 654 Start: True X-CU-ApplicationId: FB31CC89-63D2-4296-A806-33DBA8DA56F2 X-CU-ClientVersion: 3.0.150531 X-CU-ConversationId: 01aef68e-bfd8-4a47-8d70-cf5155ba7e82 X-CU-Locale: en-US X-CU-LogLevel: 1 X-CU-RequestId: 0976c5ec-dd20-4dad-aac6-a8d2d8c0e467 X-CU-UtteranceId: bcd631cd-a3d6-406d-8254-07b469d50258 X-LOBBY-MESSAGE-TYPE: audio.stream.start X-Search-IG: 37226844f90542e6b96b5c27891eea97 X-WebSocketMessageId: C#117
...

3. the client already sends all results in json to the server but no public event on the local site containing the results is still raised

X-CU-ApplicationId: FB31CC89-63D2-4296-A806-33DBA8DA56F2 X-CU-ClientVersion: 3.0.150531 X-CU-ConversationId: 01aef68ebfd84a478d70cf5155ba7e82 X-CU-Locale: en-US X-CU-LogLevel: 1 X-CU-RequestId: c0131af6-c3ef-44b1-84f3-fd22a51cce08 X-CU-UtteranceId: 1a6ec674-32cb-4e3a-9232-99604805291a X-LOBBY-MESSAGE-TYPE: audio.stream.hypothesis X-Search-IG: 37226844f90542e6b96b5c27891eea97 X-WebSocketMessageId: C#128

{
  "AudioSizeTime": 0,
  "Confidence": -2,
  "DisplayText": "yellow background",
  "Grammar": {
    "GrammarContent": "",
    "GrammarUri": "grammar:dynamic",
    "SharingUri": "",
    "Weight": 0.000000
  },
  "InverseTextNormalizationResult": "",
  "LexicalForm": "yellow background",
  "Locale": "",
  "MaskedInverseTextNormalizationResult": "",
  "PhraseElements": [{
    "AudioSizeTime": 0,
    "AudioTimeOffset": 0,
    "Confidence": 1,
    "DisplayAttributes": 2,
    "DisplayText": "yellow",
    "LexicalForm": "yellow",
    "Pronunciation": "jɛlo",
    "SREngineAcousticModelScore": 0.737334,
    "SREngineConfidence": 0.737334,
    "SREngineLanguageModelScore": 0.737334
  }, {
    "AudioSizeTime": 0,
    "AudioTimeOffset": 0,
    "Confidence": -1,
    "DisplayAttributes": 2,
    "DisplayText": "background",
    "LexicalForm": "background",
    "Pronunciation": "bækgɹa͡ʊnd",
    "SREngineAcousticModelScore": 0.057647,
    "SREngineConfidence": 0.057647,
    "SREngineLanguageModelScore": 0.057647
  }],
  "PhrasePredictorSet": {
    "ClassId": "{A6299882-0D05-4B57-8D51-ED4EF5FC43FF}",
    "Id": "{68CA9B14-08C5-48E2-99E8-F5A9209C9D97}",
    "ResultId": "{45988E21-796E-4542-8C06-E0B9074E3EC0}",
    "SequentialFloatPredictorValues": [-99.755013, 2.954085, 6.766057, 0.284852, 0.249412, -4.989283, 12.099265, 0.000000, 0.986047, -2.501928, 0.000000, 0.000000, 0.323272, 136.000000, -1.004268, 0.922584, 207.844193, 12.597917, 9.543732, 0.435455, 30.866814, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 1.966529, 2.098140, 2.000000, -0.420938],
    "Version": 1
  },
  "Properties": [{
    "Children": [{
      "Children": [],
      "Confidence": 1,
      "CountOfElements": 2,
      "FirstElement": 0,
      "Name": "KEY_BACKGROUND",
      "SREngineConfidence": 0.998984,
      "Value": "COLOR_YELLOW"
    }],
    "Confidence": 1,
    "CountOfElements": 2,
    "FirstElement": 0,
    "Name": "_value",
    "SREngineConfidence": 0.998984,
    "Value": ""
  }],
  "Rule": {
    "Children": [{
      "Children": [{
        "Children": [],
        "Confidence": -2,
        "CountOfElements": 1,
        "FirstElement": 0,
        "Name": "color",
        "SREngineAcousticModelScore": 0.000000,
        "SREngineConfidence": 0.737334,
        "SREngineLanguageModelScore": 0.000000
      }],
      "Confidence": -2,
      "CountOfElements": 2,
      "FirstElement": 0,
      "Name": "background_Color",
      "SREngineAcousticModelScore": 0.000000,
      "SREngineConfidence": 0.367504,
      "SREngineLanguageModelScore": 0.000000
    }],
    "Confidence": -2,
    "CountOfElements": 2,
    "FirstElement": 0,
    "Name": "colorChooser",
    "SREngineAcousticModelScore": 0.000000,
    "SREngineConfidence": 0.367504,
    "SREngineLanguageModelScore": 0.000000
  },
  "SREngineConfidence": 0.367504,
  "StartTime": 0
}

4. the server produces XML intermediate response packets like this

Content-Type:text/xml X-CU-RequestId:0976c5ec-dd20-4dad-aac6-a8d2d8c0e467 X-FD-ImpressionGUID:37226844-f905-42e6-b96b-5c27891eea97 X-CU-ResultType:IntermediateResult X-Lobby-ServiceResponseStatusCode:200 X-Lobby-ServiceResponseStatusDesc: X-Lobby-ServiceResponseType:IntermediateResponse X-LOBBY-MESSAGE-TYPE:audio.stream.response

<?xml version="1.0" encoding="utf-8"?>
<CUResponse xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="IntermediateResponse">
  <Entry type="DebugInfo">
    <Content type="text/xml">
      <DebugInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <TraceID>84610b813bbd43fe809657418b53e954</TraceID>
        <DateTime>2017-06-07T01:01:57.1094987-07:00</DateTime>
        <MachineName>DB5SCH101052018</MachineName>
        <ConversationID>01aef68e-bfd8-4a47-8d70-cf5155ba7e82</ConversationID>
        <PropertyBag />
        <ImpressionGUID>37226844f90542e6b96b5c27891eea97</ImpressionGUID>
        <ServiceVersion />
      </DebugInfo>
    </Content>
  </Entry>
  <Entry type="DisplayText">
    <Content type="text/plain">yellow</Content>
  </Entry>
</CUResponse>

5. the client device send a binary log data packet

X-LOBBY-MESSAGE-TYPE: log.data
...

6. the server sends a json final response

Content-Type:text/xml X-CU-ResultType:PhraseResult X-CU-ConversationId:01aef68ebfd84a478d70cf5155ba7e82 X-CU-RequestId:0976c5ec-dd20-4dad-aac6-a8d2d8c0e467 X-FD-ImpressionGUID:37226844f90542e6b96b5c27891eea97 X-CU-ServiceVersion:3.0.150531 X-Lobby-ServiceResponseStatusCode:200 X-Lobby-ServiceResponseStatusDesc: X-Lobby-ServiceResponseType:ConversationResponse X-LOBBY-MESSAGE-TYPE:audio.stream.response

<?xml version="1.0" encoding="utf-8"?>
<CUResponse xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="ConversationResponse">
  <Entry type="SpeechRecognitionResult">
    <Content type="application/json">

{
  "RecognitionStatus": "200",
  "RecognizedPhrase": {
    "InverseTextNormalizationResults": "yellow background",
    "LexicalForm": "yellow background",
    "DisplayText": "Yellow background.",
    "SREngineConfidence": "0.5927509",
    "PhraseElements": ["jɛlo", "bækgɹa͡ʊnd"],
    "MaskedInverseTextNormalizationResults": "yellow background",
    "DictationPhrases": null,
    "MediaTime": 3000000,
    "MediaDuration": 13000000
  },
  "Alternates": [{
    "InverseTextNormalizationResults": "yellow background",
    "LexicalForm": "yellow background",
    "DisplayText": "Yellow background.",
    "SREngineConfidence": "0.3805378",
    "PhraseElements": ["jɛlo", "bækgɹa͡ʊnd"],
    "MaskedInverseTextNormalizationResults": "yellow background",
    "DictationPhrases": null,
    "MediaTime": 3000000,
    "MediaDuration": 13000000
  }, {
    "InverseTextNormalizationResults": "yellow background",
    "LexicalForm": "yellow background",
    "DisplayText": "Yellow background.",
    "SREngineConfidence": "0.3805378",
    "PhraseElements": ["jɛlo", "bækgɹa͡ʊnd"],
    "MaskedInverseTextNormalizationResults": "yellow background",
    "DictationPhrases": null,
    "MediaTime": 3000000,
    "MediaDuration": 13000000
  }],
  "RecognitionArbitrationResult": "2"
}

</Content>
</Entry>
<Entry type="ConfusionNetworkResult">
  <Content type="text/xml">
    <ConfusionNetworkResult xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <Version>1</Version>
      <WordTable>
        <Word>yellow</Word>
        <Word>background</Word>
      </WordTable>
      <Nodes>
        <Node>
          <AudioTimeOffset>3000000</AudioTimeOffset>
          <FirstFollowingArc>0</FirstFollowingArc>
        </Node>
        <Node>
          <AudioTimeOffset>9600000</AudioTimeOffset>
          <FirstFollowingArc>1</FirstFollowingArc>
        </Node>
        <Node>
          <AudioTimeOffset>15800000</AudioTimeOffset>
          <FirstFollowingArc>65535</FirstFollowingArc>
        </Node>
      </Nodes>
      <Arcs>
        <Arc>
          <PreviousNodeIndex>0</PreviousNodeIndex>
          <NextNodeIndex>1</NextNodeIndex>
          <WordStartIndex>0</WordStartIndex>
          <Score>1.000003</Score>
          <IsLastArc>true</IsLastArc>
        </Arc>
        <Arc>
          <PreviousNodeIndex>1</PreviousNodeIndex>
          <NextNodeIndex>2</NextNodeIndex>
          <WordStartIndex>1</WordStartIndex>
          <Score>1.000003</Score>
          <IsLastArc>true</IsLastArc>
        </Arc>
      </Arcs>
      <BestArcsIndexes>
        <ArcIndex>0</ArcIndex>
        <ArcIndex>1</ArcIndex>
      </BestArcsIndexes>
    </ConfusionNetworkResult>
  </Content>
</Entry>
<Entry type="DebugInfo">
  <Content type="text/xml">
    <DebugInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <TraceID>84610b81-3bbd-43fe-8096-57418b53e954</TraceID>
      <DateTime>2017-06-07T08:01:57.3438758Z</DateTime>
      <MachineName>DB5SCH101052018</MachineName>
      <ConversationID>01aef68e-bfd8-4a47-8d70-cf5155ba7e82</ConversationID>
      <PropertyBag>
        <Property Key="SpeechRecognitionExecution" Value="Complete" />
        <Property Key="LanguageUnderstandingExecution" Value="Error" />
        <Property Key="DialogEngineExecution" Value="Error" />
        <Property Key="CASISessionContextLoadExecution" Value="Skip" />
      </PropertyBag>
      <ImpressionGUID>37226844-f905-42e6-b96b-5c27891eea97</ImpressionGUID>
      <ServiceVersion>3.0.150531</ServiceVersion>
      <Components />
    </DebugInfo>
  </Content>
</Entry>
<Entry type="ErrorEntries">
  <Content type="text/xml">
    <ErrorEntries xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" />
  </Content>
</Entry>
</CUResponse>

7. the client sends a final binary data log and finally the ResultGenerated event will be rised

X-LOBBY-MESSAGE-TYPE: log.data è Y7²ô& YÚ©O½È\Ûî C éYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e úYÚ©O½È\Ûî ûYÚ©O½È\Ûî ÷YÚ©O½È\ÛîC ìYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e C éYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e C éYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e ùYÚ©O½È\Ûî

All this happens before the ContinuousRecognitionSession.ResultGenerated event is invoked, it happens with a "local" speech recognizer, with a local custom SRGS grammar. But, in this scenario, if the device is not connected to a network nothing happens to that and the result of speech recognition occurs 10-20 times before!

Can you please add any kind of question to your post? For now, your post only contains some facts and it's not really clear what you want. How to speed things up? How to prevent network access and thus increase performance? One can guess, but still. Imho, it looks like API first tries network resource assisted speech recognition first and only if it fails moves to local grammar. Maybe you can force later behavior somehow. — Sergey.quixoticaxis.Ivanov, Jun 05 '17 at 15:03
@Sergey.quixoticaxis.Ivanov I add the phrase: "Is there a way to have the full speed in speech-recognition even when the device is connected to a network?" - thank you — Claudio Cayo Castagnetti, Jun 05 '17 at 16:08
Btw I've tried to look something up in UWP documentation but no luck =( — Sergey.quixoticaxis.Ivanov, Jun 05 '17 at 17:57
I have seen that if the internet connection is active, the app sends a request to this URL: http://websockets.platform.bing.com:443 ... if it does not respond it retires another 19 times for each speech recognition stage, however If the server responds, the application will no longer send requests to the remote server — Claudio Cayo Castagnetti, Jun 06 '17 at 09:03
Were you able to track what's inside the response? Is it large? I'm not much into speech recognition, but could API get some SRGSs from bing to use offline (it's my only idea because it's strange to connect to somewhere only to stay offline afterwards)? — Sergey.quixoticaxis.Ivanov, Jun 06 '17 at 09:56
I found that if the connection to the server happens the app sends every detail of voice recognition session to the server (by https): everything is pronounced, semantics, confidence scores ...but also: the device type and model, geolocation and much much more! The schema is JSON. — Claudio Cayo Castagnetti, Jun 06 '17 at 10:17
@Sergey.quixoticaxis.Ivanov: yes, in my app the SpeechRecognizer is LOCAL and the grammar is LOCAL and custom! =O Exactly as in this scenario: https://github.com/Microsoft/Windows-universal-samples/blob/master/Samples/SpeechRecognitionAndSynthesis/cs/Scenario_ContinuousRecognitionSRGSGrammar.xaml.cs — Claudio Cayo Castagnetti, Jun 06 '17 at 10:22
If it's sending lots of data to the server including device model, geolocation and everything, then it can be one more way of Microsoft to gather telemetry for their own recognition projects (which would be in Bing division, I believe) **lol** It looks like it's by design, but you may still want to post a ticket on [Connect](http://connect.microsoft.com/VisualStudio), if you haven't done it already. — Sergey.quixoticaxis.Ivanov, Jun 06 '17 at 12:28
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/145997/discussion-between-claudio-cayo-castagnetti-and-sergey-quixoticaxis-ivanov). — Claudio Cayo Castagnetti, Jun 06 '17 at 17:20
It does not depend on telemetry. The app also sends the full audio stream and waits for the Bing server response but it will not use this because it will instead use what it initially has produced by the local SR engine and the local grammar (and which has also initially already sent to the server) — Claudio Cayo Castagnetti, Jun 07 '17 at 15:36
Related https://stackoverflow.com/questions/32996908/windows-universal-app-continuous-dictation-without-network — Nikolay Shmyrev, Jun 18 '17 at 11:22
@NikolayShmyrev thank you for your contribution but in the scenario that I exposed, **the problem is only about the speed** of get the speech recognizer result, **very different speeds depending on the networking even with the SRGS local grammar!** — Claudio Cayo Castagnetti, Jun 18 '17 at 13:26

Speech Recognition performance in UWP

UPDATE1:

UPDATE2:

0 Answers0