I have downloaded the Kaggle Speech Accent Archive to learn how to handle audio data. I'm comparing three ways of reading mp3's in this dataset. The first uses Tensorflow's AudioIOTensor
, the second uses Librosa
and the third uses PyDub
. I let each of them read the same mp3 file, however, all 3 get different results on the same file.
I used this code:
import librosa
import numpy as np
import os
import pathlib
import pyaudio
from pydub import AudioSegment as pydub_AudioSegment
from pydub.utils import mediainfo as pydub_mediainfo
import tensorflow as tf
import tensorflow_io as tfio
DATA_DIR = <Path to data>
data_path = pathlib.Path(DATA_DIR)
mp3Files = [x for x in data_path.iterdir() if '.mp3' in x.name]
def load_audios(file_list):
dataset = []
for curr_file in file_list:
tf2 = tfio.audio.AudioIOTensor(curr_file.as_posix())
librsa, librsa_sr = librosa.load(curr_file.as_posix())
pdub = pydub_AudioSegment.from_file(curr_file.as_posix(),'mp3')
dataset.append([tf2, librsa, librsa_sr, pdub, curr_file.name])
return dataset
audios = load_audios(mp3Files[0:1]) # Reads 'afrikaans1.mp3' from
# Kaggle's Speech Accent Archive
tf2 = audios[0][0]
libr = audios[0][1]
libr_sr = audios[0][2]
pdub = audios[0][3]
But now when I start comparing the way these 3 modules read the same mp3 file I see this behavior for Tensorflow's AudioIOTensor:
>> tf2_arr = tf.squeeze(tf2.to_tensor(),-1).numpy()
>> tf2_arr, tf2, tf2_arr.shape # Gives raw data, sampling rate & shape
(array([ 0.00905748, 0.01102116, 0.00883307, ..., -0.00131128,
-0.00134344, -0.00090137], dtype=float32),
<AudioIOTensor: shape=[916057 1], dtype=<dtype: 'float32'>, rate=44100>,
(916057,))
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
This behavior for Librosa:
>> libr, libr_sr, libr.shape # Gives raw data, sampling rate & shape
(array([ 0.00711342, 0.01064209, 0.00806945, ..., -0.00168153,
-0.00148052, 0. ], dtype=float32),
22050,
(458029,))
And for PyDub, I see this:
>> pdub_data = np.array(pdub.get_array_of_samples())
>> pdub_data, pdub.frame_rate, pdub_data.shape # Gives raw data, sampling rate
# & shape
(array([297, 361, 289, ..., -43, -44, -30], dtype=int16), 44100, (916057,))
Although all the raw values disagreed with each other, the first confirming thing I noticed is that the AudioIOTensor
and PyDub
result had the same sampling frequency(44100) and the same shape((916057,)). Yet, Librosa
's result had a sampling frequency(22050) and shape dimensions((458029)) that were half the sampling frequency and shape dimensions of the other two techniques.
Next, I looked to see where the max and mins of each array was. I found this:
>> np.argmax(tf2_arr), np.argmin(tf2_arr)
(113149, 106715)
>> np.argmax(pdub_data), np.argmin(pdub_data)
(113149, 106715)
>> np.argmax(libr)*2, np.argmin(libr)*2
(113150, 106714)
So, allowing for the fact that Librosa
has half the sampling rate of the other two libraries, all three libraries agree on where the max's and min's are.
Lastly, I decided to see if Tensorflow's AudioIOTensor
and PyDub's
result were separated by a constant multiplicative factor by taking the average of the ratio of the maxes and mins:
>> pdub_data[113149]/tf2_arr[113149], pdub_data[106715]/tf2_arr[106715]
(32768.027, 32768.184)
>> test = tf2_arr * 32768.105
>> diff = test-pdub_data
>> np.max(diff), np.min(diff)
(0.578125, -0.5917969)
Since pdub_data
had values ranging from 23864 to -22269 (i.e. I checked np.max(pdub_data)
and np.min(dub_data)
), I was willing to assume that if the differences were bounded by +/- 0.6
, they were due to rounding and similar effects. I was willing to assume that the same would hold for Librosa
, but now I'm left wondering why?
I would've thought that reading an mp3 file wouldn't leave room for interpretation. Raw data was stored using whatever rules mp3 uses and should be recovered when the file is read.
Why do these 3 libraries differ in the raw numbers they return, and in 1 case differ in the sampling rate corresponding to the returned data? How can I get one or all of them to return the raw data stored in the mp3 format? Should I attach any significance to the fact that the ratio between the pydub_data
values and the tf2_arr
values is 32678 (i.e. 2^15)?
=====================================================================
Later Thoughts: I'm wondering if part of the reason for the differences between these libraries lies in the variable type they use. Librosa
uses float32 and PyDub
uses int16. So, It might make sense that PyDub
sees twice as many numbers as Librosa
, which gives it twice the sampling rate. Similarly, AudioIOTensor
differs from PyDub
by a factor of 2^15. If one prepends 15 bits to a 16 bit int, with one more to handle the sign, one could conceivably get a 32 bit float. But both of these cases seem to imply that one set of numbers will be, in some sense, 'wrong'.....