I am working on approx 9mn rows - applying a pyspark UDF on each of them which blows up the data to 2bn rows.
I am then grouping the resultant dataframe which results in 64mn rows (fc_ss below grouped to fc_agg). When I do fc_agg.show() I get an Illegalstateexception while if I do no get it for fc_ss. Limiting the number of rows i work with does solve the problem but that doesn't help since I need the solution for all of the rows.
Is there something I can change in my query to resolve this?