0

I am currently developing a speech recognition service using the Google Speech API (Python).

There is no sound other than the voice of the voice actor in the Korean listening evaluation mp3 file which is being used as the sample now.

I am currently using long_running_recognize after converting my mp3 file to FLAC and uploading it to Google Storage, but the accuracy of the file is only 60% for 2 minutes.

I think i used the most intuitive data as a sample and I want to know if the length of the file affects the rate of recognition and if you can improve performance.

  • Ideally you have the raw source of the audio (if you recorded it yourself), so you could use FLAC directly. Converting mp3 to FLAC just makes a compressed file larger - it doesn't add information. – cgnorthcutt Aug 27 '18 at 22:43

1 Answers1

1

You may have not gotten a response (I see its been 11 months since you posted) because the confidence score isn't something that is up to you - its simply Google's way to let you know how confident their model transcript prediction is given your input file. If you want higher confidence, provide "easier to understand" audio files (clear recording, slow, articulated speech, no accent, etc).

However, there are some things you can do. You should try to use lossless audio (.flac or .wav) with at least 16-bits per sample and a high sample rate (most people try to record at 44100 hertz). Importantly, do not perform any background noise removal on your audio before transcribing. Google Speech API analyzes the noise and uses it to clean up your file in their pipeline - by removing the noise, you just compromise their transcription pipeline.

You can learn more about best practices to improve transcription (and likely the confidence score) here: https://cloud.google.com/speech-to-text/docs/best-practices

cgnorthcutt
  • 3,890
  • 34
  • 41