5

I can pivot a smaller dataset fine using pandas, dask, or pyspark.

However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.

I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?

If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).

Can anyone help out here? I'll go ahead and add a bounty for this.

Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.

#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))
user798719
  • 9,619
  • 25
  • 84
  • 123
  • regarding Spark: have you tried to [tune it](https://spark.apache.org/docs/latest/tuning.html)? Good places to start would be `spark.default.parallelism` and/or the number of partitions of your dataframe. Besides, have you set Spark's memory (the default is 1GB)? Do you use the builtin `pivot` function or some UDF? – akoeltringer Aug 14 '17 at 07:35
  • For the versions of dask and pyspark code which crashed for you, maybe you should include the code you tried, so that people can point out potentially better ways. – mdurant Aug 14 '17 at 16:51

0 Answers0