How to structure dask to stream to random forrest classifier

Asked Aug 25 '16 at 07:03

Active Aug 25 '16 at 11:24

Viewed 320 times

I have a number of h5py datasets in 1 file where the class label is the dataset name and the shape is (20000,250000) of float64 compressed using gzip

How would the community suggest I use dask to enable random forrest training without needing to load entire datasets into memory?

Im working with a high core, high memory instance.

I should have mentioned I have 3 class labels.

Update: My current thinking of loading the data was to create a dask array for each class label with the shape of (20000,250000) then concatenate the 3 arrays together. If I did that would I be able to use the distributed random forrest mentioned in the comments to then create the smaller training and test data frames needed?

edited Aug 25 '16 at 11:24

asked Aug 25 '16 at 07:03

mobcdi

1,532
2
28
49

Which random forest implementation are you intending to use? – cel Aug 25 '16 at 07:46
scikit-learn but if you think there is a better alternative I'd look at that too – mobcdi Aug 25 '16 at 08:00
You might look at http://matthewrocklin.com/blog/work/2016/04/20/dask-distributed-part-5 – MRocklin Aug 25 '16 at 10:20
That does look interesting, any recommendations for working the data from h5py to a state I could break down to datasets to train multiple classifiers on? – mobcdi Aug 25 '16 at 11:13

How to structure dask to stream to random forrest classifier

0 Answers0