1

I want to use Dask to read a large dataset and feed with it a Keras model. The data consists of audio files and I am using a custom function to read them. I have tried to apply delayed to this function and I collect all of the files in a dask array, as:

x = da.stack([da.from_delayed(delayed(get_item_data)(fp, sr, mono, post_processing, data_shape), shape=data_shape, dtype=np.float32) for fp in df['path']])

(See the source)

To train the Keras model, I compute X and Y as above and I input them to the function fit.

However, the training is very slow. I have tried to change the chunksizeand it is still very slow.

Could you tell me if I am doing something wrong when creating the array? Or any good practices for it?

Thanks

jl.da
  • 627
  • 1
  • 11
  • 30

1 Answers1

4

As far as I know Keras doesn't have any built-in support for Dask.arrays. So I'm not sure what will happen when you provide a dask.array directly to Keras functions. My guess is that it will automatically convert the dask.array into a (possibly very large) numpy array.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks for your answer @MRocklin, I thought Keras would support Dask.arrays because I saw [this thread](http://forums.fast.ai/t/managing-large-datasets-in-memory-and-on-disk/1412/19). Is there any workaround that you suggest? – jl.da May 08 '17 at 14:10
  • You would probably have to feed keras with a sequence of numpy arrays. Perhaps iterate and slice over your dask array in a for loop? `for i in ...: model.train(..., {keras_x: x[i, ...]})` – MRocklin May 16 '17 at 13:05
  • One of the ways to try and improve performance would be to turn the shuffle argument of the fit/train command off/0 (or set it to 'batch') that way it only reads the array in linear blocks which should be faster (this is how the keras HDF5Matrix is setup to train more efficiently) – kmader Oct 12 '17 at 18:39