I am new to using dask, although I have experience in parallel computing and other libraries. I was wondering if someone had good suggestions about which block sizes I should use.
I have the done the following workflow previously in memory using scikit-learn
with a smaller matrix. I would like to now scale up using my full dataset.
My matrix will be roughly 4,000 x 2,000,0000 -- and I will be doing the following:
1) Creating the large matrix from 6-8 smaller numpy
files (I can convert the files to HDF5
if necessary)
2) Converting the 2,000,000 columns each to a categorical array. I was thinking I could use the dummy encoder available in dask-ml
. Currently, I use the OneHotEncoder
from scikit-learn
. For this I would assume a block that was bigger in the column sense would be more useful?
3) Then running the partial SGD
classifer on the one-hot-matrix. I want to run with the l1 penalty - so I will be running this a few times to get the optimum value of the C parameter. Note: I am running a model with weights. For this I would assume that a block that was bigger in the row sense would be more useful?
I have run this previously with my smaller set using just scikit-learn
to make sure everything works and gives reasonable answers and to tune the parameters for the SGD
classifer so it gives me the same answers as I got with the linear SVC
.
My system parameters are pretty flexible since I have access to variety of node types using Azure
. But, I probably would using a machine with no more than 50-100 Gigs of memory.
Thanks!