I'm trying to encode categorical data with one-hot encoding using dask and export it to csv.
The data in question is "movie-actors.dat" from hetrec2011-movielens-2k-v2 (available at https://grouplens.org/datasets/hetrec-2011/). It looks like this (I'm only interested in the first two columns):
movieID actorID
0 1 annie_potts
1 1 bill_farmer
2 1 don_rickles
3 2 bonnie_hunt
4 2 bradley_pierce
5 2 darryl_henriques
Raw data fits into memory just fine: [231742 rows x 2 columns]. There are 95321 unique actorIDs and 10174 movieIDs so I'm going to end up with 10174 rows and 95321 columns which means it should take up roughly 970 megabytes plus some overhead.
The way I'm trying to encode is:
import dask.dataframe as dd
dd.get_dummies(df.categorize()).groupby('movieID').sum()
which results in:
actorID_annie_potts actorID_bill_farmer actorID_bonnie_hunt actorID_bradley_pierce actorID_darryl_henriques actorID_don_rickles
movieID
1 1 1 0 0 0 1
2 0 0 1 1 1 0
Running it on full data fills up all available memory (~ 13 GB) and results in MemoryError.
Repartitioning
dd.get_dummies(df.categorize().repartition(npartitions=20))
using cache and single thread scheduler doesn't help.
The problem might be that in the intermediate step (before performing sum()) the dataframe would have 231742 rows and 95321 columns which would take up at least 22GB which is more than my swap partition.
How can I make it work?