0

currently I am working with a speech to text translation model that takes a .wav file and turns the audible speech within the audio into a text transcript. The model worked before on .wav audio recordings that were recorded directly. However now I am trying to do the same with audio that was at first present within a video.

The steps are as follows:

  • retrieve a video file from a stream url through ffmpeg
  • strip the .aac audio from the video
  • convert the .aac audio to .wav
  • save the .wav to s3 for later usage

The ffmpeg command I use is listed below for reference:

  rm /tmp/jonas/*
  ffmpeg -i {stream_url} -c copy -bsf:a aac_adtstoasc /tmp/jonas/{filename}.aac
  ffmpeg -i /tmp/jonas/{filename}.aac /tmp/jonas/{filename}.wav
  aws s3 cp /tmp/jonas/{filename}.wav {s3_audio_save_location}

The problem now is that my speech to text model does not work on this audio anymore. I use sox to convert the audio but sox does not seem to grab the audio. Also without sox the model does not work. This leads me to believe there is a difference in the .wav audio formatting and therefore I would like to know how I can either format the .wav with the same settings as a .wav that does work or find a way to compare the .wav audio formatting and set the new .wav to the correct format manually through ffmpeg

I tried with PyPy exiftool and found the metadata of the two files:

The metadata of the working .wav file is enter image description here

The metadata of the .wav file that does not work is enter image description here

So as can be seen the working .wav file has some different settings that I would like to mimic in the second .wav file presumably that would make my model work again :)

with kind regards, Jonas

Jonas
  • 67
  • 9
  • For next time please keep in mind that images of text are less ideal than copying and pasting the text. Images are often harder to parse, needlessly take up more space, unusable in terms of accessibility, and text can't be copied from images. – llogan Nov 18 '20 at 19:31

1 Answers1

0

I found the answer, needed to adjust the transformation from .aac to .wavs into the following line:

ffmpeg -i /tmp/jonas/{filename}.aac -ac 1 -ar 8000 /tmp/jonas/{filename}.wav

.aac copies directly from the video. -ac adjusts the amount of channels. -ac adjust the sample rate

Jonas
  • 67
  • 9
  • 1
    1) You can combine both ffmpeg commands into one: `ffmpeg -i {stream_url} -ac 1 -ar 8000 /tmp/jonas/{filename}.wav` 2) Seems odd that text-to-speech wouldn't work with 48000, but accepts the worse 8000. Check that it only needs mono, and try without modifying the sample rate (`-ar`). – llogan Nov 18 '20 at 19:30
  • I did try to have the ffmpeg command combined however for some strange reason when I would download and manually listen to the .wav afterwards only the first 2 seconds or so of audio were available and after that it was empty. Rather strange. Fair point about the sample rate. Will try that after the first batch of transcripts to see if it has an influence on the quality of transcriptions, thnx! – Jonas Nov 20 '20 at 09:33