Is there a standard way to load/process (audio) data dynamically in tensorflow?

Question

I'm building a network using the Nsynth dataset. It has some 22 Gb of data. Right now I'm loading everything into RAM but this presents some (obvious) problems.

This is an audio dataset and I want to window the signals and produce more examples changing the hop size for example, but because I don't have infinite amounts of RAM there are very little things I can do before I ran out of it (I'm actually only working with a very small subset of the dataset, don't tell google how I live).

Here's some code I'm using right now:

Code:

def generate_audio_input(audio_signal, window_size):
    audio_without_silence_at_beginning_and_end = trim_silence(audio_signal, frame_length=window_size)
    splited_audio = windower(audio_without_silence_at_beginning_and_end, window_size, hop_size=2048)
    return splited_audio

start = time.time()

audios = StrechableNumpyArray()

window_size = 5120
pathToDatasetFolder = 'audio/test'
time_per_loaded = []
time_to_HD = []

for file_name in os.listdir(pathToDatasetFolder):
    if file_name.endswith('.wav'):
        now = time.time()
        audio, sr = librosa.load(pathToDatasetFolder + '/' + file_name, sr=None)
        time_to_HD.append(time.time()-now)
        output = generate_audio_input(audio, window_size)
        audios.append(np.reshape(output, (-1)))
        time_per_loaded.append(time.time()-now)
audios = audios.finalize()
audios = np.reshape(audios, (-1, window_size))
np.random.shuffle(audios)
end = time.time()-start
print("wow, that took", end, "seconds... might want to change that to mins :)")
print("On average, it took", np.average(time_per_loaded), "per loaded file")
print("With an standard deviation of", np.std(time_per_loaded))

I'm thinking I could load only the filenames, shuffle those and then yield X loaded results for a more dynamical approach, but in that case I will still have all the different windows for a sound inside those X loaded results, giving me not a very good randomization.

I've also looked into TFRecords but I don't think that would improve anything from what I propose in the last paragraph.

So, to the clear question: Is there a standard way to load/process (audio) data dynamically in tensorflow?

I would appreciate it if the response is tailored to the particular problem I'm addressing of pre-processing my dataset before starting training.

I would also accept it if the answer is pre-process the data and save it into a TFRecord and then load the TFRecord, but I think that's sort of an overkill.

score 0 · Accepted Answer · answered Apr 06 '18 at 12:33

After discussing with some colleges during the last few months, I now think that the standard is indeed to use TFRecords. After making a few and understanding how to work with them I found several advantages and some drawbacks when using them with audio.

Advantages:

They completely all enqueuing issues with minimal strain on RAM.
There are solutions to load examples randomly. How many examples you load on RAM will depend on how frequently you want to go to the HD and how much information you want to load each time you access it.
They are easy to share and the pre-processing is (usually) already incorporated. You can have several processes using them or several people across different continents with a certainty that you are all using the same data. This is not true when working with raw audio and processing it on the fly as different software may apply computations differently (i.e. stft implementations may change soon).

Drawbacks:

They are too static. If you want to change your dataset in any way you need to create a new one. There is no way to modify every or any example. E.g., after a few iterations I decided to discard tensors with low amplitude. I could handle that in the code after loading a batch, but the only sensible way would be to discard the whole batch every time I found an outlier.
Creating them is a cumbersome and slow process. There is no way to start working with a TFRecord until it's complete. Additionally, if you decide to change the size of the tensors or the data type, you're going to have to make extra changes to your code and test them as some errors (e.g. data types) just pass silently.
Large on HD. Because TFRecords have examples that are feed directly into your network, they are not equivalent to raw audio files and you can not erase them. And because some of the examples in the TFRecord are product of data-augmentation techniques, they tend to be larger than the original files. (This last one is probably just a normal consequence of working with big datasets).

All in all, I think even though they are not tailored for audio and they are not very easy to implement at first, they are quite convenient and useful. Which is probably the reason why most people that work with big datasets and whom I've asked this question said they use them.

How can you use TFRecords for audio? They are meant to use with images, am i right? Can you be more exact? — Shahryar, Dec 23 '19 at 11:55
Audio and images are both binary data that can be represented with something like an array of numbers. If you can use it for images, you can use it for audio. — Jack, Jun 12 '20 at 02:54

Is there a standard way to load/process (audio) data dynamically in tensorflow?

1 Answers1