0

I'm using Google's nodejs-speech package to use the longRunningRecognize endpoint/function in Google's Speech API.

I've used both v1 and v1p1beta, and run into an error with longer files. (48 mins is as long as I've tried, and 15 mins causes the same problem, though 3 mins does not). I've tried both the promise pattern and separating the request into two parts -- one to start the longRunningRecognize process, and the other to check on results after waiting. The error is shown below the code samples for both.

Example promise version of request:

import speech from '@google-cloud/speech';

const client = new speech.v1p1beta1.SpeechClient();

const audio = {
  uri: 'gs://my-bucket/file.m4a'
};

const config = {
  encoding: 'AMR_WB',
  sampleRateHertz: 16000,
  languageCode: 'en-US',
  enableWordTimeOffsets: true,
  enableSpeakerDiarization: true
};

const request = {
  audio,
  config
};

client.longRunningRecognize(request)
  .then(data => {
    const operation = data[0];
    return operation.promise();
  })
  .then(data => {
    const response = data[0];
    const results = response.results;
    const transcription = results
      .filter(result => result.alternatives)
      .map(result => result.alternatives[0].transcript)
      .join('\n');
    console.log(transcription);
  })
  .catch(error => {
    console.error(error);
  });

(I've since closed the tab with the results, but I think this returned an error object that just said { error: { code: 13 } }, which matches the below, more descriptive error).

Separately, I've tried a version where instead of chaining promises to get the final transcription result, I collect the name from the operation, and make a separate request to get the result.

Here's that request code:

... // Skipping setup
client.longRunningRecognize(request)
  .then(data => {
    const operation = data[0];
    console.log(operation.latestResponse.name);
  })
  .catch(error => {
    console.error(error);
  });

When I hit the relevant endpoint (https://speech.googleapis.com/v1p1beta1/operations/81703347042341321989?key=ABCD12345) before it's had time to process, I get this:

{
    "name": "81703347042341321989",
    "metadata": {
        "@type": "type.googleapis.com/google.cloud.speech.v1p1beta1.LongRunningRecognizeMetadata",
        "startTime": "2018-08-16T19:33:26.166942Z",
        "lastUpdateTime": "2018-08-16T19:41:31.456861Z"
    }
}

Once it's fully processed, though, I've been running into this:

{
    "name": "81703347042341321989",
    "metadata": {
        "@type": "type.googleapis.com/google.cloud.speech.v1p1beta1.LongRunningRecognizeMetadata",
        "progressPercent": 100,
        "startTime": "2018-08-16T17:20:28.772208Z",
        "lastUpdateTime": "2018-08-16T17:44:40.868144Z"
    },
    "done": true,
    "error": {
        "code": 13,
        "message": "Server unavailable, please try again later."
    }
}

I've tried with shorter audio files (3 mins, same format and encoding), and the above processes both worked.

Any idea what's going on?

Sasha
  • 6,224
  • 10
  • 55
  • 102
  • 1
    This kind of errors could happen if there are long periods of silence at the beginning of the audio. Would it be the case for your audio files? In those cases, sometimes changing the encoding of your files to FLAC is helpful. – Héctor Neri Aug 16 '18 at 20:47
  • Most of these don't have long silences, but I can try Flac and report back. – Sasha Aug 16 '18 at 20:53
  • Flac worked! Thanks! And weird. Feel like the errors could be a lot clearer here. – Sasha Aug 16 '18 at 22:25

1 Answers1

2

A possible workaround is changing the audio format to FLAC, which is the recommended encoding type for Cloud Speech-to-text API due to its lossless compression.

For reference, this can be done using sox, through the following command:

sox file.m4a --rate 16k --bits 16 --channels 1 file.flac

Additionally, this error may also happen when there is a long period of silence at the beginning. In this case, the audio files can be trimmed by specifying after trim the amount of seconds the audio should skip at the beginning and at the end of the file:

sox input.m4a --rate 16k --bits 16 --channels 1 output.flac trim 20 5
Héctor Neri
  • 1,384
  • 9
  • 13