So I'm working on a speech to text project using Python and Google Cloud Services (for phone calls). The mp3s I receive have one voice playing in the left speaker, the other voice in the right speaker.
So during testing, I manually split the original mp3 file into two WAV files (one for each channel, converted to mono). I did this splitting through Audacity. The accuracy was about 80-90%, which was perfect for my purposes.
However, once I tried to automate the splitting using ffmpeg (more specifically: ffmpeg -i input_filename.mp3 -map_channel 0.0.0 left.wav -map_channel 0.0.1 right.wav), the accuracy dropped drastically.
I've been experimenting for about a week now but I can't get the accuracy up. For what it's worth, the audio files sound identical to the human ear. I found that when I increase the volume of the output files, the accuracy gets better, but never as good as when I did the splitting with Audacity.
I guess what I'm trying to ask is, what does Audacity do differently?
here are the sox -n stat results for each file:
**Split with ffmpeg(~20-30% accuracy): **
Samples read: 1690560
Length (seconds): 211.320000
Scaled by: 2147483647.0
Maximum amplitude: 0.433350
Minimum amplitude: -0.475739
Midline amplitude: -0.021194
Mean norm: 0.014808
Mean amplitude: -0.000037
RMS amplitude: 0.028947
Maximum delta: 0.333557
Minimum delta: 0.000000
Mean delta: 0.009001
RMS delta: 0.017949
Rough frequency: 789
Volume adjustment: 2.102
Split with Audacity: (80-90% accuracy)
Samples read: 1689984
Length (seconds): 211.248000
Scaled by: 2147483647.0
Maximum amplitude: 0.217194
Minimum amplitude: -0.238373
Midline amplitude: -0.010590
Mean norm: 0.007423
Mean amplitude: -0.000018
RMS amplitude: 0.014510
Maximum delta: 0.167175
Minimum delta: 0.000000
Mean delta: 0.004515
RMS delta: 0.008998
Rough frequency: 789
Volume adjustment: 4.195
original mp3:
Samples read: 3379968
Length (seconds): 211.248000
Scaled by: 2147483647.0
Maximum amplitude: 1.000000
Minimum amplitude: -1.000000
Midline amplitude: -0.000000
Mean norm: 0.014124
Mean amplitude: -0.000030
RMS amplitude: 0.047924
Maximum delta: 1.015332
Minimum delta: 0.000000
Mean delta: 0.027046
RMS delta: 0.067775
Rough frequency: 1800
Volume adjustment: 1.000
One thing that stands out to me is that the duration isn't the same. Also the amplitudes. Can I instruct ffmpeg what the duration is when it is doing the splitting? And can I change all the amplitudes to match the audacity file? I'm not sure what to do to get to the 80% accuracy rate, but increasing volume seems to be the most promising solution so far.
Any help would be greatly appreciated. I don't have to use ffmpeg, but it seems like my only option, as Audacity isn't scriptable.