I noticed that when the Windows 10 CU device (from PC to Raspberry) is not connected to any network, the speech recognition with a local grammar file (for example ContinuousRecognitionSession scenario) performance is remarkable: less than 10 milliseconds, but when the device is connected to a Wi-Fi network performance drops by about 10 times! And if the Wi-Fi network does not have access to the Internet, performance is reduced by more than 2 times!
More in detail the time it expands is between the beginning of the speechRecognizer “SoundEnded” state and when the
speechRecognizer.ContinuousRecognitionSession.ResultGenerated
event is invoked.
Is there a way to have the full speed in speech-recognition even when the device is connected to a network?
Even the official Microsoft UWP sample accuses the same problem:
Thanks for any idea :)
UPDATE1:
I have seen that if the internet connection is active, the app sends a request to this URL: websockets.platform.bing.com:443
... if it does not respond it retires another 19 times for each speech recognition stage, however If the server responds, the app will no longer send requests to the remote server because it keeps the first https session active all the time.
UPDATE2:
I found that if the connection to the server happens, the app sends every detail of voice recognition session to the server (by https): everything is pronounced, semantics, confidence scores ...but also: the device type and model, geolocation and much much more! The schema is JSON.
But that's not all, the device sends to the server, even the binary audio stream, and the server responds with the json intermediate hypotheses until the stream is completed and, at the end, the server sends the final response.
In details...
1. at the beginning of session, the device identifies itself with a packet like this:
X-CU-ApplicationId: FB31CC89-63D2-4296-A806-33DBA8DA56F2
X-CU-ClientVersion: 3.0.150531
X-CU-ConversationId: 01aef68ebfd84a478d70cf5155ba7e82
X-CU-Locale: en-US
X-CU-LogLevel: 1
X-CU-RequestId: 4885e9c5-a2be-4167-b27c-5ad55024dcb4
X-LOBBY-MESSAGE-TYPE: connection.context
X-Search-IG: 37226844f90542e6b96b5c27891eea97
X-WebSocketMessageId: C#116
{
"Groups": {
"ConversationContext": {
"Id": "ConversationContext",
"Info": {
"PreferClientReco": "false",
"TurnId": "0"
},
"Items": []
},
"LocalProperties": {
"Id": "LocalProperties",
"Info": {
"AudioSourceType": "None",
"CurrentTime": "2017-06-07T10:01:47+02:00",
"DrivingModeActive": "false",
"GeoLocation": "{\"Uri\":\"entity:\/\/GeoCoordinates\",\"Version\":\"1.0\",\"Latitude\":45.48179547076163,\"Longitude\":9.18281614780426,\"Accuracy\":64}",
"InCall": "false",
"IsActiveDisplayHMD": "false",
"LockState": "Invalid",
"MicrophoneInfo": "audio stream (xxx)",
"ModeOfTravel": "Undefined",
"ProximitySensorState": "Invalid",
"SpeechAppInitiatedRequest": "false",
"SystemInfo": "{\"DeviceMake\":\"Microsoft Corporation\",\"DeviceModel\":\"Surface Pro 2\",\"DeviceFamily\":\"Windows.Desktop\",\"OsVersion\":\"6.3\",\"Qfe\":3145728,\"Branch\":\"rs2_release\",\"LanguageCode\":1033,\"Protocol\":\"1.0\",\"OsName\":\"Windows 10 Enterprise\",\"TimeZone\":\"W. Europe Standard Time\",\"RegionalFormatCode\":\"it-IT\",\"Mkt\":\"en-US\",\"CortanaEnabled\":false,\"NonNativeSpeech\":false,\"TestHook\":false}",
"TetheredDeviceMake": "",
"TetheredDeviceModel": "",
"UserAgeClass": "Adult"
},
"Items": [{
"DisplayText": "Adult",
"Id": "UserAgeClass",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "UserAgeClass"
}, {
"DisplayText": "false",
"Id": "SpeechAppInitiatedRequest",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "SpeechAppInitiatedRequest"
}, {
"DisplayText": "Undefined",
"Id": "ModeOfTravel",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "ModeOfTravel"
}, {
"DisplayText": "audio stream (xxx)",
"Id": "MicrophoneInfo",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "MicrophoneInfo"
}, {
"DisplayText": "Invalid",
"Id": "LockState",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "LockState"
}, {
"DisplayText": "Invalid",
"Id": "ProximitySensorState",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "ProximitySensorState"
}, {
"DisplayText": "{\"DeviceMake\":\"Microsoft Corporation\",\"DeviceModel\":\"Surface Pro 2\",\"DeviceFamily\":\"Windows.Desktop\",\"OsVersion\":\"6.3\",\"Qfe\":3145728,\"Branch\":\"rs2_release\",\"LanguageCode\":1033,\"Protocol\":\"1.0\",\"OsName\":\"Windows 10 Enterprise\",\"TimeZone\":\"W. Europe Standard Time\",\"RegionalFormatCode\":\"it-IT\",\"Mkt\":\"en-US\",\"CortanaEnabled\":false,\"NonNativeSpeech\":false,\"TestHook\":false}",
"Id": "SystemInfo",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "SystemInfo"
}, {
"DisplayText": "false",
"Id": "InCall",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "InCall"
}, {
"DisplayText": "",
"Id": "TetheredDeviceMake",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "TetheredDeviceMake"
}, {
"DisplayText": "{\"Uri\":\"entity:\/\/GeoCoordinates\",\"Version\":\"1.0\",\"Latitude\":45.48179547076163,\"Longitude\":9.18281614780426,\"Accuracy\":64}",
"Id": "GeoLocation",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "GeoLocation"
}, {
"DisplayText": "",
"Id": "TetheredDeviceModel",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "TetheredDeviceModel"
}, {
"DisplayText": "false",
"Id": "DrivingModeActive",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "DrivingModeActive"
}, {
"DisplayText": "None",
"Id": "AudioSourceType",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "AudioSourceType"
}, {
"DisplayText": "false",
"Id": "IsActiveDisplayHMD",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "IsActiveDisplayHMD"
}, {
"DisplayText": "2017-06-07T10:01:47+02:00",
"Id": "CurrentTime",
"Info": {},
"Items": [],
"KnownAs": [],
"Name": "CurrentTime"
}]
},
"RecoProperties": {
"Id": "RecoProperties",
"Info": {
"ClientType": "3P",
"KeywordStreamed": "false",
"ModelRevision": "1",
"OptIn": "true",
"Scenario": "WP_GSE",
"UserAgeClass": "Adult"
},
"Items": []
}
},
"OnScreenItems": [],
"SrScenario": "WP_GSE"
}
2. the device send all binary packets containing the audio stream
EncodingFormat: 654
Start: True
X-CU-ApplicationId: FB31CC89-63D2-4296-A806-33DBA8DA56F2
X-CU-ClientVersion: 3.0.150531
X-CU-ConversationId: 01aef68e-bfd8-4a47-8d70-cf5155ba7e82
X-CU-Locale: en-US
X-CU-LogLevel: 1
X-CU-RequestId: 0976c5ec-dd20-4dad-aac6-a8d2d8c0e467
X-CU-UtteranceId: bcd631cd-a3d6-406d-8254-07b469d50258
X-LOBBY-MESSAGE-TYPE: audio.stream.start
X-Search-IG: 37226844f90542e6b96b5c27891eea97
X-WebSocketMessageId: C#117
...
3. the client already sends all results in json to the server but no public event on the local site containing the results is still raised
X-CU-ApplicationId: FB31CC89-63D2-4296-A806-33DBA8DA56F2
X-CU-ClientVersion: 3.0.150531
X-CU-ConversationId: 01aef68ebfd84a478d70cf5155ba7e82
X-CU-Locale: en-US
X-CU-LogLevel: 1
X-CU-RequestId: c0131af6-c3ef-44b1-84f3-fd22a51cce08
X-CU-UtteranceId: 1a6ec674-32cb-4e3a-9232-99604805291a
X-LOBBY-MESSAGE-TYPE: audio.stream.hypothesis
X-Search-IG: 37226844f90542e6b96b5c27891eea97
X-WebSocketMessageId: C#128
{
"AudioSizeTime": 0,
"Confidence": -2,
"DisplayText": "yellow background",
"Grammar": {
"GrammarContent": "",
"GrammarUri": "grammar:dynamic",
"SharingUri": "",
"Weight": 0.000000
},
"InverseTextNormalizationResult": "",
"LexicalForm": "yellow background",
"Locale": "",
"MaskedInverseTextNormalizationResult": "",
"PhraseElements": [{
"AudioSizeTime": 0,
"AudioTimeOffset": 0,
"Confidence": 1,
"DisplayAttributes": 2,
"DisplayText": "yellow",
"LexicalForm": "yellow",
"Pronunciation": "jɛlo",
"SREngineAcousticModelScore": 0.737334,
"SREngineConfidence": 0.737334,
"SREngineLanguageModelScore": 0.737334
}, {
"AudioSizeTime": 0,
"AudioTimeOffset": 0,
"Confidence": -1,
"DisplayAttributes": 2,
"DisplayText": "background",
"LexicalForm": "background",
"Pronunciation": "bækgɹa͡ʊnd",
"SREngineAcousticModelScore": 0.057647,
"SREngineConfidence": 0.057647,
"SREngineLanguageModelScore": 0.057647
}],
"PhrasePredictorSet": {
"ClassId": "{A6299882-0D05-4B57-8D51-ED4EF5FC43FF}",
"Id": "{68CA9B14-08C5-48E2-99E8-F5A9209C9D97}",
"ResultId": "{45988E21-796E-4542-8C06-E0B9074E3EC0}",
"SequentialFloatPredictorValues": [-99.755013, 2.954085, 6.766057, 0.284852, 0.249412, -4.989283, 12.099265, 0.000000, 0.986047, -2.501928, 0.000000, 0.000000, 0.323272, 136.000000, -1.004268, 0.922584, 207.844193, 12.597917, 9.543732, 0.435455, 30.866814, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 1.966529, 2.098140, 2.000000, -0.420938],
"Version": 1
},
"Properties": [{
"Children": [{
"Children": [],
"Confidence": 1,
"CountOfElements": 2,
"FirstElement": 0,
"Name": "KEY_BACKGROUND",
"SREngineConfidence": 0.998984,
"Value": "COLOR_YELLOW"
}],
"Confidence": 1,
"CountOfElements": 2,
"FirstElement": 0,
"Name": "_value",
"SREngineConfidence": 0.998984,
"Value": ""
}],
"Rule": {
"Children": [{
"Children": [{
"Children": [],
"Confidence": -2,
"CountOfElements": 1,
"FirstElement": 0,
"Name": "color",
"SREngineAcousticModelScore": 0.000000,
"SREngineConfidence": 0.737334,
"SREngineLanguageModelScore": 0.000000
}],
"Confidence": -2,
"CountOfElements": 2,
"FirstElement": 0,
"Name": "background_Color",
"SREngineAcousticModelScore": 0.000000,
"SREngineConfidence": 0.367504,
"SREngineLanguageModelScore": 0.000000
}],
"Confidence": -2,
"CountOfElements": 2,
"FirstElement": 0,
"Name": "colorChooser",
"SREngineAcousticModelScore": 0.000000,
"SREngineConfidence": 0.367504,
"SREngineLanguageModelScore": 0.000000
},
"SREngineConfidence": 0.367504,
"StartTime": 0
}
4. the server produces XML intermediate response packets like this
Content-Type:text/xml
X-CU-RequestId:0976c5ec-dd20-4dad-aac6-a8d2d8c0e467
X-FD-ImpressionGUID:37226844-f905-42e6-b96b-5c27891eea97
X-CU-ResultType:IntermediateResult
X-Lobby-ServiceResponseStatusCode:200
X-Lobby-ServiceResponseStatusDesc:
X-Lobby-ServiceResponseType:IntermediateResponse
X-LOBBY-MESSAGE-TYPE:audio.stream.response
<?xml version="1.0" encoding="utf-8"?>
<CUResponse xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="IntermediateResponse">
<Entry type="DebugInfo">
<Content type="text/xml">
<DebugInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<TraceID>84610b813bbd43fe809657418b53e954</TraceID>
<DateTime>2017-06-07T01:01:57.1094987-07:00</DateTime>
<MachineName>DB5SCH101052018</MachineName>
<ConversationID>01aef68e-bfd8-4a47-8d70-cf5155ba7e82</ConversationID>
<PropertyBag />
<ImpressionGUID>37226844f90542e6b96b5c27891eea97</ImpressionGUID>
<ServiceVersion />
</DebugInfo>
</Content>
</Entry>
<Entry type="DisplayText">
<Content type="text/plain">yellow</Content>
</Entry>
</CUResponse>
5. the client device send a binary log data packet
X-LOBBY-MESSAGE-TYPE: log.data
...
6. the server sends a json final response
Content-Type:text/xml
X-CU-ResultType:PhraseResult
X-CU-ConversationId:01aef68ebfd84a478d70cf5155ba7e82
X-CU-RequestId:0976c5ec-dd20-4dad-aac6-a8d2d8c0e467
X-FD-ImpressionGUID:37226844f90542e6b96b5c27891eea97
X-CU-ServiceVersion:3.0.150531
X-Lobby-ServiceResponseStatusCode:200
X-Lobby-ServiceResponseStatusDesc:
X-Lobby-ServiceResponseType:ConversationResponse
X-LOBBY-MESSAGE-TYPE:audio.stream.response
<?xml version="1.0" encoding="utf-8"?>
<CUResponse xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="ConversationResponse">
<Entry type="SpeechRecognitionResult">
<Content type="application/json">
{
"RecognitionStatus": "200",
"RecognizedPhrase": {
"InverseTextNormalizationResults": "yellow background",
"LexicalForm": "yellow background",
"DisplayText": "Yellow background.",
"SREngineConfidence": "0.5927509",
"PhraseElements": ["jɛlo", "bækgɹa͡ʊnd"],
"MaskedInverseTextNormalizationResults": "yellow background",
"DictationPhrases": null,
"MediaTime": 3000000,
"MediaDuration": 13000000
},
"Alternates": [{
"InverseTextNormalizationResults": "yellow background",
"LexicalForm": "yellow background",
"DisplayText": "Yellow background.",
"SREngineConfidence": "0.3805378",
"PhraseElements": ["jɛlo", "bækgɹa͡ʊnd"],
"MaskedInverseTextNormalizationResults": "yellow background",
"DictationPhrases": null,
"MediaTime": 3000000,
"MediaDuration": 13000000
}, {
"InverseTextNormalizationResults": "yellow background",
"LexicalForm": "yellow background",
"DisplayText": "Yellow background.",
"SREngineConfidence": "0.3805378",
"PhraseElements": ["jɛlo", "bækgɹa͡ʊnd"],
"MaskedInverseTextNormalizationResults": "yellow background",
"DictationPhrases": null,
"MediaTime": 3000000,
"MediaDuration": 13000000
}],
"RecognitionArbitrationResult": "2"
}
</Content>
</Entry>
<Entry type="ConfusionNetworkResult">
<Content type="text/xml">
<ConfusionNetworkResult xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Version>1</Version>
<WordTable>
<Word>yellow</Word>
<Word>background</Word>
</WordTable>
<Nodes>
<Node>
<AudioTimeOffset>3000000</AudioTimeOffset>
<FirstFollowingArc>0</FirstFollowingArc>
</Node>
<Node>
<AudioTimeOffset>9600000</AudioTimeOffset>
<FirstFollowingArc>1</FirstFollowingArc>
</Node>
<Node>
<AudioTimeOffset>15800000</AudioTimeOffset>
<FirstFollowingArc>65535</FirstFollowingArc>
</Node>
</Nodes>
<Arcs>
<Arc>
<PreviousNodeIndex>0</PreviousNodeIndex>
<NextNodeIndex>1</NextNodeIndex>
<WordStartIndex>0</WordStartIndex>
<Score>1.000003</Score>
<IsLastArc>true</IsLastArc>
</Arc>
<Arc>
<PreviousNodeIndex>1</PreviousNodeIndex>
<NextNodeIndex>2</NextNodeIndex>
<WordStartIndex>1</WordStartIndex>
<Score>1.000003</Score>
<IsLastArc>true</IsLastArc>
</Arc>
</Arcs>
<BestArcsIndexes>
<ArcIndex>0</ArcIndex>
<ArcIndex>1</ArcIndex>
</BestArcsIndexes>
</ConfusionNetworkResult>
</Content>
</Entry>
<Entry type="DebugInfo">
<Content type="text/xml">
<DebugInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<TraceID>84610b81-3bbd-43fe-8096-57418b53e954</TraceID>
<DateTime>2017-06-07T08:01:57.3438758Z</DateTime>
<MachineName>DB5SCH101052018</MachineName>
<ConversationID>01aef68e-bfd8-4a47-8d70-cf5155ba7e82</ConversationID>
<PropertyBag>
<Property Key="SpeechRecognitionExecution" Value="Complete" />
<Property Key="LanguageUnderstandingExecution" Value="Error" />
<Property Key="DialogEngineExecution" Value="Error" />
<Property Key="CASISessionContextLoadExecution" Value="Skip" />
</PropertyBag>
<ImpressionGUID>37226844-f905-42e6-b96b-5c27891eea97</ImpressionGUID>
<ServiceVersion>3.0.150531</ServiceVersion>
<Components />
</DebugInfo>
</Content>
</Entry>
<Entry type="ErrorEntries">
<Content type="text/xml">
<ErrorEntries xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" />
</Content>
</Entry>
</CUResponse>
7. the client sends a final binary data log and finally the ResultGenerated event will be rised
X-LOBBY-MESSAGE-TYPE: log.data
è Y7²ô& YÚ©O½È\Ûî C éYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e úYÚ©O½È\Ûî ûYÚ©O½È\Ûî ÷YÚ©O½È\ÛîC ìYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e C éYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e C éYÚ©O½È\Ûî, a u d i o . s t r e a m . r e s p o n s e ùYÚ©O½È\Ûî
All this happens before the ContinuousRecognitionSession.ResultGenerated event is invoked, it happens with a "local" speech recognizer, with a local custom SRGS grammar. But, in this scenario, if the device is not connected to a network nothing happens to that and the result of speech recognition occurs 10-20 times before!