Deepspeech - inferring for more audio files and saving the output

Question

I am done with my training on common voice data for deepspeech from Mozilla and now I am able to get output for a single audio .wav file. Below is the command I am using.

(deepspeech-venv) megha@megha-medion:~/Alu_Meg/DeepSpeech_Alug_Meg/DeepSpeech$ ./deepspeech my_exportdir/model.pb/output_graph.pb models/alphabet.txt myAudio_for_testing.wav

here, myAudio_for_testing.wav is the audio file I am using to get the below output.

TensorFlow: v1.6.0-9-g236f83e
DeepSpeech: v0.1.1-44-gd68fde8
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-06-29 14:51:35.832686: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
heritor teay we decide the lunch ha annral limined eddition of y ye com im standmat

heritor teay we decide the lunch ha annral limined eddition of y ye com im standmat

Here are my few questions,

1) The bolded sentence above is the output for my audio. how can I save this so some file?

2) I have around 2000 audio files like this. how can I read 1 by 1 and get output? I tried to write a script in python to read all the .wav audio files I have, but as my deepspeech is using some sources which are kept in a virtual environment, I am not getting how I can I write my deepspeech command inside the script. Can you guys give me some hints to proceed with? It will be a great help.

Thank you:)

Megha

megha · Answer 1 · 2018-06-29T13:46:33.347

I found a solution for my first question. we can just redirect the output to some file as below.

(deepspeech-venv) megha@megha-medion:~/Alu_Meg/DeepSpeech_Alug_Meg/DeepSpeech$ ./deepspeech my_exportdir/model.pb/output_graph.pb models/alphabet.txt myAudio_for_testing.wav > output_test.csv
TensorFlow: v1.6.0-9-g236f83e
DeepSpeech: v0.1.1-44-gd68fde8
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-06-29 15:22:50.275833: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

I just added > output_test.csv after my command.

But I still could not figure out with my second question.

score 0 · Accepted Answer · answered Jul 12 '18 at 07:40

For my 2nd question, I added an extra section in client.py file of Deepspeech to encounter a number of files and save each transcripts in an excel file with the corresponding file name as an index value.

> r =csv.reader(open('my_CSV_file.csv')) lines = list(r) pathToAudio =
> args.audio#sys.argv[3] audio_files = os.listdir(pathToAudio) for i in
> range(1,len(lines)):
>     for eachfile in audio_files :
>         if eachfile.endswith(".wav"):  
>             if(eachfile == lines[i][1]):
>               file_Path = pathToAudio + "/" + eachfile 
>                   print("File to be read is ",  file_Path)
>                   fs, audio = wav.read(file_Path)
>                   audio_length = len(audio) * ( 1 / 16000)
>                   assert fs == 16000, "Only 16000Hz input WAV files are supported for now!"
>                   print('Running inference.', file=sys.stderr)
>                   inference_start = timer()
>                   output = ds.stt(audio, fs)
>                   lines[i][2] = output                   
>                   writer = csv.writer(open('my_CSV_file', 'w'))
>                   writer.writerows(lines)
>                   print(output)
>                   inference_end = timer() - inference_start
>                   print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)

score 0 · Answer 3 · answered Aug 03 '20 at 13:36

I don't know if it's too late to respond your question but I leave my response here in case others may have the same/similar issue.

On Mozilla/DeepSpeech github page, they share a script called transcribe.py. In this script, they have a function called transcribe_many(src_paths,dst_paths). Basically, this function takes an input of a list of audio file locations (src_paths) and load them to the RAM, then inference in a multi-processing fashion. The output is written to the location of "dst_paths".

Here is the preview of the code from the file I shared the link above.

def transcribe_many(src_paths,dst_paths):
    pbar = create_progressbar(prefix='Transcribing files | ', max_value=len(src_paths)).start()
    for i in range(len(src_paths)):
        p = Process(target=transcribe_file, args=(src_paths[i], dst_paths[i]))
        p.start()
        p.join()
        log_progress('Transcribed file {} of {} from "{}" to "{}"'.format(i + 1, len(src_paths), src_paths[i], dst_paths[i]))
        pbar.update(i)
    pbar.finish()

Deepspeech - inferring for more audio files and saving the output

3 Answers3