0

I have a number of h5py datasets in 1 file where the class label is the dataset name and the shape is (20000,250000) of float64 compressed using gzip

How would the community suggest I use dask to enable random forrest training without needing to load entire datasets into memory?

Im working with a high core, high memory instance.

I should have mentioned I have 3 class labels.

Update: My current thinking of loading the data was to create a dask array for each class label with the shape of (20000,250000) then concatenate the 3 arrays together. If I did that would I be able to use the distributed random forrest mentioned in the comments to then create the smaller training and test data frames needed?

mobcdi
  • 1,532
  • 2
  • 28
  • 49
  • Which random forest implementation are you intending to use? – cel Aug 25 '16 at 07:46
  • scikit-learn but if you think there is a better alternative I'd look at that too – mobcdi Aug 25 '16 at 08:00
  • You might look at http://matthewrocklin.com/blog/work/2016/04/20/dask-distributed-part-5 – MRocklin Aug 25 '16 at 10:20
  • That does look interesting, any recommendations for working the data from h5py to a state I could break down to datasets to train multiple classifiers on? – mobcdi Aug 25 '16 at 11:13

0 Answers0