0

I am using a microphone which records sound through a browser, converts it into a file and sends the file to a java server. Then, my java server sends the file to the cloud speech api and gives me the transcription. The problem is that the transcription is super long (around 3.7sec for 2sec of dialog).

So I would like to speed up the transcription. The first thing to do is to stream the data (if I start the transcription at the beginning of the record. The problem is that I don't really understand the api. For instance if I want to transcript my audio stream from the source (browser/microphone) I need to use some kind of JS api, but I can't find anything I can use in a browser (we can't use node like this can we?).

Else I need to stream my data from my js to my java (not sure how to do it without breaking the data...) and then push it through streamingRecognizeFile from there : https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/speech/cloud-client/src/main/java/com/example/speech/Recognize.java

But it takes a file as the input, so how am I supposed to use it? I cannot really tell the system I finished or not the record... How will it understand it is the end of the transcription?

I would like to create something in my web browser just like the google demo there : https://cloud.google.com/speech/

I think there is some fundamental stuff I do not understand about the way to use the streaming api. If someone can explain a bit how I should process about this, it would be owesome.

Thank you.

DeepProblems
  • 597
  • 4
  • 24
  • Have look at this [example](https://github.com/googleapis/nodejs-speech) Node.js client for Google Cloud Speech. – Yurci Jul 16 '18 at 10:45

1 Answers1

1

Google "Speech-to-Text typically processes audio faster than real-time, processing 30 seconds of audio in 15 seconds on average" [1]. You can use Google APIs Explorer to test exactly how long your each request would take [2].

To speed up the transcribing you may try to add recognition metadata to your request [3]. You can provide phrase hints if you are aware of the context of the speech [4]. Or use enhanced models to use special set of machine learning models [5]. All these suggestions would improve the accuracy and might have effects on transcribing speed.

When using the streaming recognition, in config you can set singleUtterance option to True. This will detect if user pause speaking and cease the recognition. If not streaming request will continue until to the content limit, which is 1 minute of audio length for streaming request [6].

Yurci
  • 556
  • 2
  • 9