25s Latency in Google Speech to Text

Question

This is a problem I ran into using the Google Speech to Text Engine. I am currently streaming 16 bit / 16 kHz audio real time in 32kB chunks. But there is an average 25 second latency between sending audio and receiving transcripts, defeating the purpose of real time transcription.

Why is there such high latency?

There's a 24 hour wait time before you can accept your own answer. — Jacob Stern, Jul 27 '18 at 14:25

score 7 · Accepted Answer · answered Jul 26 '18 at 18:41

7

The Google Speech to Text documentation recommends using a 100 ms frame size to minimize latency.

32kB * (8 bits / 1 byte) * ( 1 sample / 16 bits ) * (1 sec / 16000 samples ) = 1 sec.

So try sending 3.2kB chunks instead. That dropped average latency from 25s to ~4s.

answered Jul 26 '18 at 18:41

Jacob Stern

3,758
3
32
54

Can you elaborate a bit more on how the formula works? I am facing the same issue but for a 44100 sampling rate. – Jash Shah Oct 23 '20 at 13:07
1

Use the same formula, but change the last term to 1/44100. Then do the algebra do solve for how many kB would give you 100 ms on the right side. – Jacob Stern Oct 23 '20 at 14:57
Thanks! Are these values in kilobytes or kilobits? – Jash Shah Oct 23 '20 at 16:43
1

kB = kilobytes . Is that what you are asking? – Jacob Stern Oct 23 '20 at 16:45

25s Latency in Google Speech to Text

1 Answers1

Linked