1

I am trying to connect to the Azure Pronunciation Assessment service by API in my PHP Laravel application. It has been frustrating to figure out due to the lack of API documentation from Azure, but I have managed to get a 200 response with data.

Unfortunately the data the API returns always has an "Accuracy" score of 0.0. When I test with Azure's Speech Studio my accuracy is listed in the high 90s, near 100%.

My only guess is that it's not processing the audio file I send it through the API correctly for some reason. Hoping someone here has experience with this and can help me figure out what I'm doing wrong.

First I record the audio via Javascript in my Vue application like this:

methods: {
    recordAudio() {
      navigator.mediaDevices.getUserMedia({audio: true, video: false})
        .then(stream => {
          this.mediaRecorder = new MediaRecorder(stream);
          this.mediaRecorder.addEventListener('start', this.onRecordingStart);
          this.mediaRecorder.addEventListener('stop', this.onRecordingStop);
          this.mediaRecorder.addEventListener('dataavailable', this.onRecordingDataAvailable);
          this.mediaRecorder.start();
        })
        .catch(error => {
          console.log(error);
        });
    },
    stopRecording() {
      this.mediaRecorder.stop();
    },
    onRecordingStart() {
      this.isRecording = true;
    },
    onRecordingDataAvailable(event) {
      this.audioChunks.push(event.data);
    },
    onRecordingStop() {
      this.isRecording = false;
      const audioBlob = new Blob(this.audioChunks, {'type': 'audio/wav'});
      this.assessPronunciation(audioBlob);
      const audioUrl = URL.createObjectURL(audioBlob);
      this.audio = new Audio(audioUrl);
    },
    assessPronunciation(audioBlob) {
      const formData = new FormData();
      formData.append('audio', audioBlob, 'recording.wav');
      formData.append('text', this.text);
      axios.post('/api/pronunciation-assessment', formData)
        .then(res => {
        })
        .catch(err => {
          console.log(err);
        });
    },

You can see in the assessPronunciation method that I send the resulting WAV blob to my backend.

On the backend my controller that receives the request looks like this:

public function apiPostPronunciationAssessment(
        Request $request,
        AzureSpeechServicesApiClient $speechClient
    ): string {
        $audio = $request->file('audio');
        $text = $request->get('text');

        return $speechClient->assessPronunciation($text, $audio->getContent());
    }

The Azure API client that the controller uses looks like this:

<?php

namespace App\Services\Speech;

use GuzzleHttp\Client;
use GuzzleHttp\RequestOptions;
use Illuminate\Config\Repository;

class AzureSpeechServicesApiClient
{
    private string $key;
    private string $region;
    private string $pronunciationEndpoint;
    private Client $client;

    public function __construct(Repository $config)
    {
        $this->key = $config->get('services.azureSpeech.key');
        $this->region = $config->get('services.azureSpeech.location');
        $this->pronunciationEndpoint =
            "https://$this->region.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=:lang";
    }

    public function assessPronunciation(string $text, string $audio): string
    {
        $response = $this->client()->post(
            $this->pronunciationEndpoint(),
            [
                RequestOptions::HEADERS => $this->pronunciationHeaders($text),
                RequestOptions::BODY => $audio
            ]
        );

        return $response->getBody()->getContents();
    }

    public function region(): string
    {
        return $this->region;
    }

    private function client(): Client
    {
        if (!isset($this->client)) {
            $this->client = new Client();
        }

        return $this->client;
    }

    private function pronunciationHeaders(string $text): array
    {
        return [
            'Ocp-Apim-Subscription-Key' => $this->key,
            'Content-Type' => 'audio/wav',
            'Accept' => 'application/json;text/xml',
            'Pronunciation-Assessment' => base64_encode(json_encode([
                'ReferenceText' => $text,
                'GradingSystem' => 'HundredMark',
                'PhonemeAlphabet' => 'IPA',
            ])),

        ];
    }

    private function pronunciationEndpoint(): string
    {
        $language = targetLang() === "en" ? "en-US" : "es-ES";

        return str_replace(':lang', $language, $this->pronunciationEndpoint);
    }
}

The result I get back from the Azure API is something like this:

{
  "RecognitionStatus": "Success",
  "Offset": 5700000,
  "Duration": 1100000,
  "NBest": [
    {
      "Confidence": 0.84944737,
      "Lexical": "crook",
      "ITN": "crook",
      "MaskedITN": "crook",
      "Display": "Crook.",
      "AccuracyScore": 0.0,
      "Words": [
        {
          "Word": "crook",
          "Offset": 5700000,
          "Duration": 1100000,
          "Confidence": 0.0,
          "AccuracyScore": 0.0,
          "Syllables": [...]
          "Phonemes": [...]
        }
      ]
    }
  ],
  "DisplayText": "Crook."
}

I cannot figure out for the life of me why it cannot get a good accuracy assessment from this. I have tested saving the audio from the request into a wav file locally and it plays the word I say without any problem. So while it might be a problem with the audio I am sending it, I have no idea what that problem could be.

Anyone see what could be the problem here?

Chris
  • 4,277
  • 7
  • 40
  • 55
  • Hi, I am Yulin Li in azure speech team. I am checking the issue you met and will update when I get something. – Yulin Li May 29 '23 at 15:26
  • Hi @Yulin Li. I should update you on this. I found out the audio was actually in an incorrect format after all and that is what was causing the issue. I fixed that audio issue and started getting results. But unfortunately due to inaccurate phoneme results that happen even in your own Speech Studio, I wasn't able to use your service. I opened a thread about it here: https://learn.microsoft.com/en-us/answers/questions/1284533/azure-pronunciation-assessment-returning-the-same but so far no one has helped me address the problem. Should I open a stackoverflow question about it? – Chris May 29 '23 at 17:26
  • Hi, I will ask my colleague to check the Q&A issue you post. And meanwhile, I think you can open an [GitHub issue](https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues), which is monitored frequently. – Yulin Li May 30 '23 at 15:17

0 Answers0