I want to use Dask
to read a large dataset and feed with it a Keras
model. The data consists of audio files and I am using a custom function to read them. I have tried to apply delayed
to this function and I collect all of the files in a dask array, as:
x = da.stack([da.from_delayed(delayed(get_item_data)(fp, sr, mono, post_processing, data_shape), shape=data_shape, dtype=np.float32) for fp in df['path']])
(See the source)
To train the Keras model, I compute X and Y as above and I input them to the function fit
.
However, the training is very slow. I have tried to change the chunksize
and it is still very slow.
Could you tell me if I am doing something wrong when creating the array? Or any good practices for it?
Thanks