0

I have prepared a Speech to Text system using using Kaldi and vosk.

Ive setup the directories and I am using vosk to give transcription of audio files.

The pipeline is that I run bash script which basically takes in audio file name (without extension), breaks it into chunks which are saved in a folder in the same directory. Then it runs the transcription script (vosk API with kaldi model) on each chunk and get a transcript in text file and then all chunk text are saved into one text-file.

The bash code is as follows: (#usage is $0 <audiofilename-without-extension)

#! bin/bash

af=$1
afe= $af + ".wav"

python3 chunker.py "$af"

for file in ${af}/*.wav; 
do
    python3 test_ffmpeg.py "$file" >> ${file}.txt
done 

for f in ${af}/*.txt; 
do
    echo -e $(cat "$f") '\n' >> ${af}.txt
done

The output format I get is this:

{
  "partial" : "assalamualaikum c p l c call karney ka shukria operator 13 baat kar"
}
{
  "partial" : "assalamualaikum c p l c call karney ka shukria operator 13 baat kar"
}
{
  "text" : "assalamualaikum c p l c call karney ka shukria operator 13 baat kar"
}

What I want in my output is the {"text": ""} part only, that too without the {"text":""}. Can anyone guide me how to achieve this output?

The other scripts mentioned in the bash file are as follows:

test_ffmpeg.py is from vosk-api example scripts which is as follows:

 #!/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import subprocess
import srt
import json
import datetime


SetLogLevel(0)

sample_rate=16000
model = Model("..")
rec = KaldiRecognizer(model, sample_rate)

process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
                        sys.argv[1],
                        '-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
                        stdout=subprocess.PIPE)

while True:
    data = process.stdout.read(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())
    else:
        print(rec.PartialResult())

print(rec.FinalResult())

chunker.py uses the $1 audiofilename and breaks it into chunks in a folder named as $1 variable. So if the wav file name is call21.wav it will create a folder called call21 and save all chunk files as chunk1.wav chunk2.wav and so on

import speech_recognition as sr 
import os 
import pyaudio
from pydub import AudioSegment
from pydub.silence import split_on_silence
from vosk import Model, KaldiRecognizer, SetLogLevel
import wave
import sys
import subprocess
 
fname =  sys.argv[1]  #enter name without extension
wav = ".wav"
txt = ".txt"
transcript = fname + txt
audiofilename = fname + wav
sample_rate=16000
SetLogLevel(-1)
path = audiofilename
#recognizer.SetWords(True)
#recognizer.SetPartialWords(True)

# open the audio file using pydub
sound = AudioSegment.from_wav(path)  
# split audio sound where silence is 700 miliseconds or more and get chunks
chunks = split_on_silence(sound,
    # experiment with this value for your target audio file
    min_silence_len = 1000,
    # adjust this per requirement
    silence_thresh = sound.dBFS-16,
    # keep the silence for 1 second, adjustable as well
    keep_silence=2000,
)
folder_name = fname
# create a directory to store the audio chunks
if not os.path.isdir(folder_name):
    os.mkdir(folder_name)
whole_text = ""
# process each chunk 
for i, audio_chunk in enumerate(chunks, start=1):
    # export audio chunk and save it in
    # the `folder_name` directory.
    chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
    audio_chunk.export(chunk_filename, format="wav")

if __name__ == '__main__':
    import sys
    path = audiofilename
    #path = sys.argv[1]```


 
SAGE
  • 9
  • 3

1 Answers1

0

Please, consider sttcast or chunks of its code. It splits the audio in fragments of s seconds and uses multiprocessing to take advantage of multicore platforms. Partial results are saved to HTML files that are integrated in one HTML file at the end of the work. Words are highlighted according to the confidence of the translation (as given by the vosk API).

I can transcribe in my old Linux PC (6 cores) 160 minutes of podcast in about 17 minutes. You may see a transcription of an episode of an Spanish podcast

Screenshot of the result file

J.M. Robles
  • 614
  • 5
  • 9