0

I am synthesising speech using the Google cloud APIs. I have the following information about the speech synthesise response.

  • Sample rate: 8000 Hz
  • Audio format: MP3
  • Length of the byte array

The response from the API is a byte array. Given this information, how could I approximate or accurately compute the length of the synthesised audio?

Vino
  • 2,111
  • 4
  • 22
  • 42
  • I'm not an expert in this field, but if the sample rate is 8000 Hz, I believe that means 8000 samples of the audio have been taken per second. Therefore, you should be able to divide the length of the byte array by 8000 to calculate the length of the audio clip **in seconds**. – Jacob G. Apr 12 '19 at 13:38
  • @JacobG. That would be incorrect for MP3. – Brad Apr 12 '19 at 23:36

1 Answers1

2

You don't have enough information to compute the duration of audio.

MP3 is a lossy codec, and can operate at a number of different bitrates. In fact, that bitrate can change throughout the file. Making things worse, MP3 doesn't have any inherent timestamping in its usual format. The only real way to accurately know its length is to decode it.

Alternatively if you know the bitrate, you can divide the file by the bitrate and get an approximate length. If you can assume there is a constant bitrate in the whole file, you can get the birate by reading the header of the first frame. See also: http://mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm

Brad
  • 159,648
  • 54
  • 349
  • 530
  • I already have the decoded MP3 audio as the byte string. How do I go about computing the length from there? I am using Google Cloud API in gRPC. This already returns my audio clip in decoded format according to the doc. https://cloud.google.com/text-to-speech/docs/base64-decoding Thank you. – Vino Apr 13 '19 at 03:58
  • @Vino Is it in MP3 or isn't it? You said in your question that it's in MP3, which must be decoded (at least via frame header inspection) to get the duration. What you just linked to is base64 which has nothing to do with MP3... base64 is just a method for shoving binary data into a text context. You need to decode the MP3 in the next step. – Brad Apr 13 '19 at 04:06
  • No what I am saying is my data is already in binary form because I ma using gRPC to get the synthesis audio. I am using this API, and the in the `AudiConfig` I can set the format as MP3. The output I get is a byte array. So according to MP3 spec, if I read the first 4 bytes, it will have the information of the MP3 header which may contain the info I need – Vino Apr 13 '19 at 04:15
  • @Vino Great, now decode that byte stream which is in MP3 format, to PCM. Or at least synchronize to the frame header (11 bits of `11111111111`), count the number of frames, and multiply by `1152` samples, then divide by the sample rate. And no, as I said in my answer, the MP3 header *doesn't* haven't what you need. There is no header which has a full duration of the file. It's just MPEG frames. That's why you have to parse through the whole file. – Brad Apr 13 '19 at 04:18
  • Thanks for your response. From what I understand, you want me to search the array where there are 11 subsequent 1 bits. This is my header. From this how how do I calculate the number of frames. In the MP3 spec, they advise me to use this equation `144 * BitRate / SampleRate + Padding` but I am not aware of the bit rate in this. Please advise how to do the *count the frames part*. Thank you very much. – Vino Apr 13 '19 at 04:43
  • @Vino Each MP3 frame stands alone (with the bit reservoir being an exception). They don't inherently link to each other. You have to write code to figure out where all the frame headers are, to determine how many frames there are, to determine how many PCM frames there are, to determine a duration based on the known sample rate. – Brad Apr 13 '19 at 07:23
  • I have tried searching the array for the MP3 headers as you have advised me but no avail. I have created a new question with some code. Would mind having a look? Thank you. – Vino Apr 13 '19 at 10:10