I can pivot a smaller dataset fine using pandas, dask, or pyspark.
However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.
I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?
If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).
Can anyone help out here? I'll go ahead and add a bounty for this.
Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.
#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))