0

I have used google's speech to text to transcribe an audio in which I have enabled the timestamp not an issue till then.

My problem is the timestamps are continuous there is no gap between words as per the timestamp ( with seconds and nano seconds ) given by google, but the audio I have provided has observable gap between words. Is there any possibility that, we can make the time stamp exactly state the exact timing of the word spoken in audio.

I'm trying to calculate the gap between the words and also I will calculate the silences in-between words. I need to extract the silences based on this. Any help is appreciable.

Example:
Language : Japanese

Word: ここ|ココ
Start time: 59 seconds 900000000 nanos
End time: 60 seconds 100000000 nanos
Word: で|デ
Start time: 60 seconds 100000000 nanos
End time: 60 seconds 200000000 nanos
Word: は|ワ
Start time: 60 seconds 200000000 nanos
End time: 60 seconds 300000000 nanos
Word: アニメーション|アニメーション
Start time: 60 seconds 300000000 nanos
End time: 60 seconds 800000000 nanos
Word: キー|キー
Start time: 60 seconds 800000000 nanos
End time: 61 seconds 300000000 nanos
Word: の|ノ
Start time: 61 seconds 300000000 nanos
End time: 61 seconds 400000000 nanos
Word: 削除|サクジョ
Start time: 61 seconds 400000000 nanos
End time: 61 seconds 800000000 nanos
Word: 切り取り|キリトリ
Start time: 61 seconds 800000000 nanos
End time: 62 seconds 900000000 nanos
Word: 貼り付け|ハリツケ
Start time: 62 seconds 900000000 nanos
End time: 64 seconds 0 nanos
Word: そして|ソシテ
Start time: 64 seconds 0 nanos
End time: 64 seconds 900000000 nanos
Word: その|ソノ
Start time: 64 seconds 900000000 nanos
End time: 65 seconds 100000000 nanos
Word: 他|タ,ホカ
Start time: 65 seconds 100000000 nanos
End time: 65 seconds 300000000 nanos
Word: の|ノ
Start time: 65 seconds 300000000 nanos
End time: 65 seconds 400000000 nanos
Word: コマンド|コマンド
Start time: 65 seconds 400000000 nanos
End time: 65 seconds 700000000 nanos
Word: を|オ
Start time: 65 seconds 700000000 nanos
End time: 65 seconds 900000000 nanos
Word: 使用|シヨー
Start time: 65 seconds 900000000 nanos
End time: 66 seconds 200000000 nanos
Word: する|スル
Start time: 66 seconds 200000000 nanos
End time: 66 seconds 500000000 nanos
Word: こと|コト
Start time: 66 seconds 500000000 nanos
End time: 66 seconds 700000000 nanos
Word: が|ガ
Start time: 66 seconds 700000000 nanos
End time: 66 seconds 800000000 nanos
Word: でき|デキ
Start time: 66 seconds 800000000 nanos
End time: 67 seconds 100000000 nanos
Word: ます|マス
Start time: 67 seconds 100000000 nanos
End time: 67 seconds 300000000 nanos
Word: アニメーション|アニメーション
Start time: 67 seconds 300000000 nanos
End time: 68 seconds 500000000 nanos
Word: キー|キー
Start time: 68 seconds 500000000 nanos
End time: 68 seconds 700000000 nanos
Word: を|オ
Start time: 68 seconds 700000000 nanos
End time: 68 seconds 900000000 nanos
Word: 動かす|ウゴカス
Start time: 68 seconds 900000000 nanos
End time: 69 seconds 0 nanos
  • Im using the long-running-recognise v1 api – chitharthan Jun 16 '20 at 14:13
  • Yes. Google API speech to text timestamp would be continuous. But if you want to split silence identify silence, you can use `audiok` (https://pypi.org/project/auditok/) library it works well. – Narendra Prasath Jun 16 '20 at 14:18
  • I will try auditok, But previously I have tried the pyAudioAnalysis library that provides the silence removal. Both the libraries are asking for the silence parameters, What happens is, we have to provide different parameters for different audio, It will be helpful if there is library that provides the word utterance timing or the silence timing which ever is possible. So only searched for speech to text engine for the word utterance timing in timestamp details. – chitharthan Jun 18 '20 at 08:17

1 Answers1

0

Google time stamps are continuous in between words and timing between words cannot be measured. But the time gap between two consecutive sentences can be measured, there the time stamp is not continuous. So the time gap between sentences can be measured and used for the implementation of silence removal