5

I am currently trying to create a large dataset for deep learning consisting of a lot of compressed mp3 files stored together so I dont have 100k files that I have to load individually.

x = b''
with open("file1.mp3", "rb") as f:
    x += f.read()
print(len(x)) # 362861
with open("file2.mp3", "rb") as f:
    x += f.read()
print(len(x)) # 725722
with open("testdataset", 'wb+') as f:
    f.write(x)

Now I want to load this one by one:

with open("testdataset", 'rb') as f:
    bs = f.read(362861)
    y, sr = librosa.core.load(io.BytesIO(bs), mono=True, sr=44100, dtype=np.float32) # crahes

It breaks with the following error:

RuntimeError: Error opening <_io.BytesIO object at 0x7f509ed1cf90>: File contains data in an unknown format.

For testing I tried to load the original file, which works fine:

y, sr = librosa.core.load("file1.mp3", mono=True, sr=44100, dtype=np.float32) # works fine

Note that this "dummy"-load of the original mp3 also throws a warning:

UserWarning: PySoundFile failed. Trying audioread instead. warnings.warn('PySoundFile failed. Trying audioread instead.')

Why is this happening? Is there maybe a better way to store a lot of seperate-files together and load them at once?

Here are the versions that I am using:

python: 3.8.3 (default, May 14 2020, 20:11:43) 
[GCC 7.5.0]
librosa: 0.7.2
audioread: 2.1.8
numpy: 1.19.0
scipy: 1.5.0
sklearn: 0.23.1
joblib: 0.15.1
decorator: 4.4.2
six: 1.15.0
soundfile: 0.10.3
resampy: 0.2.2
numba: 0.48.0
Jonathan R
  • 3,652
  • 3
  • 22
  • 40
  • _consisting of a lot of compressed mp3 files stored together_ Are you sure you can just concatenate all the files like that? Is the result a valid MP3 file, or other recognized format? – AMC Jun 26 '20 at 21:06
  • I dont think it should matter. I am not trying to load the concatenated file as a valid mp3 file but just the first N bytes, which is the exact same as in the first file. (362861 bytes to be exact) – Jonathan R Jun 26 '20 at 22:15
  • I had the same issue; it looks like you must write the data on disk because depending on the file format you have, `audioread` may issue shell commands that require a filename. – bfontaine Mar 20 '23 at 22:19
  • See also: https://librosa.org/doc/main/ioformats.html#read-file-like-objects for an example that uses `soundfile` to read from a `BytesIO` object. – bfontaine Mar 20 '23 at 22:30

2 Answers2

0

If you are using torchaudio, do: !pip install torch==1.11.0 torchaudio==0.11.0 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Omar
  • 1
  • 1
-1

librosa use soundfile, which does not support mp3 files (/encoding)

https://librosa.org/doc/main/generated/librosa.load.html

Netanel
  • 459
  • 1
  • 5
  • 17