1

So I'm working on a speech to text project using Python and Google Cloud Services (for phone calls). The mp3s I receive have one voice playing in the left speaker, the other voice in the right speaker.

So during testing, I manually split the original mp3 file into two WAV files (one for each channel, converted to mono). I did this splitting through Audacity. The accuracy was about 80-90%, which was perfect for my purposes.

However, once I tried to automate the splitting using ffmpeg (more specifically: ffmpeg -i input_filename.mp3 -map_channel 0.0.0 left.wav -map_channel 0.0.1 right.wav), the accuracy dropped drastically.

I've been experimenting for about a week now but I can't get the accuracy up. For what it's worth, the audio files sound identical to the human ear. I found that when I increase the volume of the output files, the accuracy gets better, but never as good as when I did the splitting with Audacity.

I guess what I'm trying to ask is, what does Audacity do differently?

here are the sox -n stat results for each file:

**Split with ffmpeg(~20-30% accuracy): **

Samples read:           1690560
Length (seconds):    211.320000
Scaled by:         2147483647.0
Maximum amplitude:     0.433350
Minimum amplitude:    -0.475739
Midline amplitude:    -0.021194
Mean    norm:          0.014808
Mean    amplitude:    -0.000037
RMS     amplitude:     0.028947
Maximum delta:         0.333557
Minimum delta:         0.000000
Mean    delta:         0.009001
RMS     delta:         0.017949
Rough   frequency:          789
Volume adjustment:        2.102

Split with Audacity: (80-90% accuracy)

Samples read:           1689984
Length (seconds):    211.248000
Scaled by:         2147483647.0
Maximum amplitude:     0.217194
Minimum amplitude:    -0.238373
Midline amplitude:    -0.010590
Mean    norm:          0.007423
Mean    amplitude:    -0.000018
RMS     amplitude:     0.014510
Maximum delta:         0.167175
Minimum delta:         0.000000
Mean    delta:         0.004515
RMS     delta:         0.008998
Rough   frequency:          789
Volume adjustment:        4.195

original mp3:

Samples read:           3379968
Length (seconds):    211.248000
Scaled by:         2147483647.0
Maximum amplitude:     1.000000
Minimum amplitude:    -1.000000
Midline amplitude:    -0.000000
Mean    norm:          0.014124
Mean    amplitude:    -0.000030
RMS     amplitude:     0.047924
Maximum delta:         1.015332
Minimum delta:         0.000000
Mean    delta:         0.027046
RMS     delta:         0.067775
Rough   frequency:         1800
Volume adjustment:        1.000

One thing that stands out to me is that the duration isn't the same. Also the amplitudes. Can I instruct ffmpeg what the duration is when it is doing the splitting? And can I change all the amplitudes to match the audacity file? I'm not sure what to do to get to the 80% accuracy rate, but increasing volume seems to be the most promising solution so far.

Any help would be greatly appreciated. I don't have to use ffmpeg, but it seems like my only option, as Audacity isn't scriptable.

Diogo A.
  • 147
  • 6
  • Can't reproduce here. Share full log of ffmpeg execution. – Gyan Mar 09 '18 at 15:30
  • The issue was: I think Audacity normalizes the volume of the two channels when I do the split. I normalized the original file with ffmpeg-normalize before splitting the two channels and the accuracy is very similar to the accuracy when I used Audacity to split the files. – Adrian Chromenko Mar 09 '18 at 20:27

0 Answers0