I'm building a network using the Nsynth dataset. It has some 22 Gb of data. Right now I'm loading everything into RAM
but this presents some (obvious) problems.
This is an audio dataset and I want to window the signals and produce more examples changing the hop size for example, but because I don't have infinite amounts of RAM
there are very little things I can do before I ran out of it (I'm actually only working with a very small subset of the dataset, don't tell google how I live).
Here's some code I'm using right now:
Code:
def generate_audio_input(audio_signal, window_size):
audio_without_silence_at_beginning_and_end = trim_silence(audio_signal, frame_length=window_size)
splited_audio = windower(audio_without_silence_at_beginning_and_end, window_size, hop_size=2048)
return splited_audio
start = time.time()
audios = StrechableNumpyArray()
window_size = 5120
pathToDatasetFolder = 'audio/test'
time_per_loaded = []
time_to_HD = []
for file_name in os.listdir(pathToDatasetFolder):
if file_name.endswith('.wav'):
now = time.time()
audio, sr = librosa.load(pathToDatasetFolder + '/' + file_name, sr=None)
time_to_HD.append(time.time()-now)
output = generate_audio_input(audio, window_size)
audios.append(np.reshape(output, (-1)))
time_per_loaded.append(time.time()-now)
audios = audios.finalize()
audios = np.reshape(audios, (-1, window_size))
np.random.shuffle(audios)
end = time.time()-start
print("wow, that took", end, "seconds... might want to change that to mins :)")
print("On average, it took", np.average(time_per_loaded), "per loaded file")
print("With an standard deviation of", np.std(time_per_loaded))
I'm thinking I could load only the filenames, shuffle those and then yield X loaded results for a more dynamical approach, but in that case I will still have all the different windows for a sound inside those X loaded results, giving me not a very good randomization.
I've also looked into TFRecords but I don't think that would improve anything from what I propose in the last paragraph.
So, to the clear question: Is there a standard way to load/process (audio) data dynamically in tensorflow?
I would appreciate it if the response is tailored to the particular problem I'm addressing of pre-processing my dataset before starting training.
I would also accept it if the answer is pre-process the data and save it into a TFRecord and then load the TFRecord, but I think that's sort of an overkill.