Time offsets speech to text - goggle speech to text API

Question

I'm trying to get google speech-to-text API to give me the precise onset and offset of voice in an audio file but the offsets it gives me are off.

I'm using the python script for this.

I get this info as the time offsets: start_time: 0.4, end_time: 1.4

The start time should be more like 0.6

The entire file is 3 seconds long. I want to get precision at the millisecond level. Is there a way to fix this or is google wave net just not precise enough to find the onset down to the millisecond?

def transcribe_file(speech_file):
    """Transcribe the given audio file asynchronously."""
    from google.cloud import speech

    client = speech.SpeechClient()

    with open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    """
     Note that transcription is limited to a 60 seconds audio file.
     Use a GCS file for audio longer than 1 minute.
    """
    audio = speech.RecognitionAudio(content=content)

    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=44100,
        language_code="de-DE",
        enable_word_time_offsets=True,
        model="latest_short"
    )

    response = client.recognize(config=config, audio=audio)
 
    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        alternative = result.alternatives[0] 
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))
        print("Confidence: {}".format(result.alternatives[0].confidence))
        
        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
            word, start_time.seconds + (start_time.nanos * 1e-9),
            end_time.seconds + (end_time.nanos * 1e-9) ) ) 
          

transcribe_file("file.wav")

```


I used the google cloud documentation, hoping to get a very precise start time. I used the model that is trained on short audio that's a few seconds long (model="latest_short") but still get a very imprecise time onset. 
 
The start time is off by 0.2 seconds in an audio file that is 3 seconds long, which is not precise or accurate enough. 

How can I get more precision?

Can you provide the audio file which you are using? – Prajna Rai T Sep 12 '22 at 11:00 — Prajna Rai T, Sep 12 '22 at 11:00

Time offsets speech to text - goggle speech to text API

0 Answers0