0

I want to change audio encoding from mulaw to linear in order to use a linear speech recognition model from Google. I'm using a telephony channel, so audio is encoded in mulaw, 8bits, 8000Hz. When I use Google Mulaw model, there are some issue with recognizing some short single words -> basically they are not recognized at all -> API returns None I was wondering if it is a good practise to change the encoding for Linear or Flac? I already did it, but I cannot really measure the degree of this improvement.

ylvi-bux
  • 37
  • 6
  • Hi @ylvi-bux, if my answer addressed your question, please consider accepting and upvoting it. If not, let me know so that I can improve my answer. – Shipra Sarkar Jan 06 '22 at 04:38

2 Answers2

2

It is always best practice to use either LINEAR16 for headerless audio data or FLAC for headered audio data. They both provide lossless codec. It is good practice to set the sampling rate to 16000 Hz otherwise you can set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Since Google Speech to Text API provides various ways to improve the audio quality, you can use World Level Confidence to measure the accuracy for response.

Shipra Sarkar
  • 1,385
  • 3
  • 10
0

Ideally the audio would be recorded to start with using lossless codec like linear16 ot flac. But once you have it in format like mulaw transcoding it before sending to Google speech-to-text is not helpful.

Consider using model=phone_call and use_enhanced=true for better telephony quality. For quick experimentation you can use STT UI https://cloud.google.com/speech-to-text/docs/ui-overview.

cherba
  • 8,681
  • 3
  • 27
  • 34