Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

Question

I can pivot a smaller dataset fine using pandas, dask, or pyspark.

However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.

I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?

If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).

Can anyone help out here? I'll go ahead and add a bounty for this.

Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.

#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))

regarding Spark: have you tried to [tune it](https://spark.apache.org/docs/latest/tuning.html)? Good places to start would be `spark.default.parallelism` and/or the number of partitions of your dataframe. Besides, have you set Spark's memory (the default is 1GB)? Do you use the builtin `pivot` function or some UDF? — akoeltringer, Aug 14 '17 at 07:35
For the versions of dask and pyspark code which crashed for you, maybe you should include the code you tried, so that people can point out potentially better ways. — mdurant, Aug 14 '17 at 16:51

Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

0 Answers0