I am interested in querying Amazon Alexa with mp3 files that contain voice commands. I know that Amazon has endpoints (SpeechRecognizer 2.3) that take in MP3 files but I am not sure if this will actually query the Alexa Service -- or more importantly interact with my skill. Any help would be appreciated!
Asked
Active
Viewed 136 times
1 Answers
1
I did this like 4+ years ago. I had to create a virtual AVS device and then a bespoke client using the AVS API. It's a lot of work.
Is there something special about these files that you need to use mp3? You can batch test audio files for speech recognition (is the speech to text what you expect?) with the ASR tool in the developer console.
With the NLU evaluation tool in the console, you can batch test utterances (formatted in JSON) to see which intents they trigger and what values they return in the slots.
And if you're working on unit tests for multi-utterance exchanges, you can use the ASK CLI or the ASK SMAPI API for automation.
The only one of these that uses MP3s is the ASR tool. The rest work with text.

LetMyPeopleCode
- 1,895
- 15
- 20
-
Hey @letmypeoplecode, thank you for the response. I am a little discouraged that there is so much work required to get this working. The reason behind the MP3 is I wanted to test how Alexa's language understanding module is working when given a voice command (although I can change the format from MP3 to other audio formats) . For this reason, I would want to pass thousands of voice files to Alexa to see how they interact with my skill. AFAIK, this doesn't seem possible in the developer console. – indispinablenorm Dec 14 '20 at 00:11
-
The NLU evaluation tool cannot be used since I want to see how different voice commands are processed and then evaluated as per my skill. You said you were able to get this working with "virtual AVS device and then a bespoke client using the AVS API" could you provide any hints to get this working? do you have another suggestion on how I could do what I wanted? Could https://echosim.io/welcome or something similar be used to do what I wanted? Do you know how the simulator is connecting to Alexa and passing in commands? – indispinablenorm Dec 14 '20 at 00:14
-
There is no one complete solution. ASR tests mp3s (or oggs or others) to see if the words Alexa gets from them are what you expect. NLU lets you test if the words you got triggered the right intents with the right slot values. Then the ASK CLI or ASK SMAPI let you script exchanges using them to see if the results you get are what you expect. AVS won't give you text from either the incoming or outgoing speech. I used my tool to repeat utterances for coding a skill in an office where I'd have people trying to kill me if I kept repeating the same phrase over and over (and I wore headphones). – LetMyPeopleCode Dec 14 '20 at 02:24
-
ASK CLI, ASK SMAPI SDK, and ASK Toolkit for VS Code are all open sourced on Github if you want to poke around in their code. The testing APIs are here https://developer.amazon.com/en-US/docs/alexa/smapi/skill-testing-operations.html. Basically ASR evaluation lets you know what Alexa would "hear" from your sound files. All the tools that test your skill's dialog model or back-end require you to submit a text string. And once you know the speech-to-text results for a sound file, why do you need to run it over and over instead of using the text? – LetMyPeopleCode Dec 14 '20 at 02:40
-
The goal of this was to understand how Alexa processes mp3 internally and triggers intents. If we were to use ASR to extract speech to text and then use NLU to see if the text triggered the right intents, would this be the same processes as passing in a voice command to Alexa? We believe that certain phrases that invoke our skill are being processed incorrectly by Alexa and want to test them through a barrage of invocations we are creating. – indispinablenorm Dec 14 '20 at 21:21
-
Yes, this is the sequence of steps Alexa goes through. Speech Recognition (ASR) to get words from the sounds, Natural Language Understanding (NLU) to understand the intent of the words and reconcile them with your dialog model. So: First test if the words are right (ASR), then use those to test if the phrases are understood correctly against your model (NLU) so the right intents get called with the right slot values. – LetMyPeopleCode Dec 15 '20 at 20:17
-
I assume that process would entirely be automatable under the AVS API client? – indispinablenorm Dec 16 '20 at 22:35
-
1Nope. AVS is for making devices. You send sound, you get sound back. For ASR and NLU, you'd automate with the ASK SMAPI SDK. https://developer.amazon.com/en-US/blogs/alexa/alexa-skills-kit/2020/05/three-tips-for-coding-with-the-alexa-smapi-sdk – LetMyPeopleCode Dec 18 '20 at 00:46
-
Hey @LetMyPeopleCode, this is probably not the best way to get your attention but I asked another alexa developer question in regards to ASR. Please let me know if you can help! https://stackoverflow.com/questions/65854896/bulk-edit-for-annotation-set-in-alexa-asr-not-working – indispinablenorm Jan 23 '21 at 01:35