4

I'm trying to encode categorical data with one-hot encoding using dask and export it to csv.

The data in question is "movie-actors.dat" from hetrec2011-movielens-2k-v2 (available at https://grouplens.org/datasets/hetrec-2011/). It looks like this (I'm only interested in the first two columns):

      movieID           actorID
0        1       annie_potts
1        1       bill_farmer
2        1       don_rickles
3        2       bonnie_hunt
4        2    bradley_pierce
5        2  darryl_henriques

Raw data fits into memory just fine: [231742 rows x 2 columns]. There are 95321 unique actorIDs and 10174 movieIDs so I'm going to end up with 10174 rows and 95321 columns which means it should take up roughly 970 megabytes plus some overhead.

The way I'm trying to encode is:

import dask.dataframe as dd

dd.get_dummies(df.categorize()).groupby('movieID').sum()

which results in:

     actorID_annie_potts  actorID_bill_farmer  actorID_bonnie_hunt  actorID_bradley_pierce  actorID_darryl_henriques  actorID_don_rickles
movieID                                                                  
1                          1                    1                    0                    0                         0                    1

2                          0                    0                    1                    1                         1                    0

Running it on full data fills up all available memory (~ 13 GB) and results in MemoryError.

Repartitioning

dd.get_dummies(df.categorize().repartition(npartitions=20))

using cache and single thread scheduler doesn't help.

The problem might be that in the intermediate step (before performing sum()) the dataframe would have 231742 rows and 95321 columns which would take up at least 22GB which is more than my swap partition.

How can I make it work?

Pstrg
  • 71
  • 5
  • Have you found a solution after all these years? I'm recently stuck on this. – HanaKaze Jan 27 '21 at 17:00
  • to be frank I think I just abandoned this approach. If you are performing this kind of operation on a single machine (not sure how these methods are going to play with dask), consider using sparse [matrix representations](https://docs.scipy.org/doc/scipy/reference/sparse.html) or spill the data to the disk using a library like [zarr](https://zarr.readthedocs.io/en/stable/tutorial.html#creating-an-array) - it handles compression transparently, offers an API pretty close to numpy arrays and if you chunk your tensors right the memory consumption should be manageable. Good luck! – Pstrg Jan 28 '21 at 19:13

0 Answers0