I have a number of h5py datasets in 1 file where the class label is the dataset name and the shape is (20000,250000
) of float64
compressed using gzip
How would the community suggest I use dask to enable random forrest training without needing to load entire datasets into memory?
Im working with a high core, high memory instance.
I should have mentioned I have 3 class labels.
Update:
My current thinking of loading the data was to create a dask array for each class label with the shape of (20000,250000)
then concatenate the 3 arrays together. If I did that would I be able to use the distributed random forrest mentioned in the comments to then create the smaller training and test data frames needed?