1

I am new to using dask, although I have experience in parallel computing and other libraries. I was wondering if someone had good suggestions about which block sizes I should use.

I have the done the following workflow previously in memory using scikit-learn with a smaller matrix. I would like to now scale up using my full dataset.

My matrix will be roughly 4,000 x 2,000,0000 -- and I will be doing the following:

1) Creating the large matrix from 6-8 smaller numpy files (I can convert the files to HDF5 if necessary)

2) Converting the 2,000,000 columns each to a categorical array. I was thinking I could use the dummy encoder available in dask-ml. Currently, I use the OneHotEncoder from scikit-learn. For this I would assume a block that was bigger in the column sense would be more useful?

3) Then running the partial SGD classifer on the one-hot-matrix. I want to run with the l1 penalty - so I will be running this a few times to get the optimum value of the C parameter. Note: I am running a model with weights. For this I would assume that a block that was bigger in the row sense would be more useful?

I have run this previously with my smaller set using just scikit-learn to make sure everything works and gives reasonable answers and to tune the parameters for the SGD classifer so it gives me the same answers as I got with the linear SVC.

My system parameters are pretty flexible since I have access to variety of node types using Azure. But, I probably would using a machine with no more than 50-100 Gigs of memory.

Thanks!

SWZ
  • 31
  • 2
  • Isn't this more of a question for [CrossValidated](https://stats.stackexchange.com) or [DataScience](https://datascience.stackexchange.com) communities? – sophros Jan 23 '18 at 13:59
  • How many nonzeros are there? I don't expect you need to use out-of-memory processing (which usually should be also very painful for SGD-like algorithms because of random-permutations in each epoch). – sascha Jan 23 '18 at 17:06
  • I believe I am following the correct suggestion for questions regarding dask -- "Usage questions are directed to Stack Overflow with the #dask tag. Dask developers monitor this tag and get e-mails whenever a question is asked." http://dask.pydata.org/en/latest/support.html – SWZ Jan 23 '18 at 17:45
  • Sascha - I am assuming you are asking if I am using sparse matrices. This would be ~64 Gigs of memory without converting to one-hot, I believe. Converting to 1-hot then makes it roughly 5-10 time bigger (if in full matrix form) which is getting me into 640 Gigs. Even, if it can be sparse (let's say 20% non-zero) -- I end up having to double or triple it for the SVM to work. This is really hitting a limit for a 50 Gig RAM machine. – SWZ Jan 23 '18 at 17:50
  • Sascha - also dask (the library I am asking about) has this already implemented, and/or you can do it via the partial fit method in scikit-learn. – SWZ Jan 23 '18 at 18:01
  • 20% is not sparse. And why should 1-hot encoding increase the size by factor 5-10? It was a simple question: how many nnz / nnz-ratio? And there is no answer. It would surprise me to run into mem-limits. Maybe you got some magic data (2M cols without 1-hot encoding; strange). I'm aware of partial-fit and i'm also aware about what happens if you use it wrong (e.g. not doing permutations; which will cost a lot for out-of-memory approaches). You will have to care about the out-of-memory setup at some place as partial_fit is just one small component expecting something. – sascha Jan 24 '18 at 11:51
  • One-hot encoding depends on the number of categorical variables per column (in my case each column is a different categorical variable). Each variable has approximately 5-10 levels (k) which means each columns is expanded into 5-10 levels. FYI- since I know my data, and how big it is - this conversations seems not terribly productive. Dask takes care of the out of memory setup - but that is why I am asking about block size (original question) to avoid unnecessary out-of-memory costs. That was the point of my original question. – SWZ Jan 25 '18 at 15:07
  • Also, whether or not you think my data is magic, it is common in science to have short and fat matrices, especially in genomics which is where my data comes from. I don't know my nnz ratio value for this particular data since I don't have it loaded in, but from experience previous data is 10-20% which is what I told you above. You can argue whether sparse is useful or not, but again not the point of my question. – SWZ Jan 25 '18 at 15:20
  • I would like iterate I do not want to argue about my matrix size or the memory it uses or how sparse it is. I would like advice on how to use Dask and what is the appropriate block size on my short and fat matrix to be as efficient as possible. I only described my workflow in detail so that the Dask peeps would have as much info as possible to give their advice. – SWZ Jan 25 '18 at 15:30

0 Answers0