1

I am working on approx 9mn rows - applying a pyspark UDF on each of them which blows up the data to 2bn rows.

I am then grouping the resultant dataframe which results in 64mn rows (fc_ss below grouped to fc_agg). When I do fc_agg.show() I get an Illegalstateexception while if I do no get it for fc_ss. Limiting the number of rows i work with does solve the problem but that doesn't help since I need the solution for all of the rows.

Is there something I can change in my query to resolve this?

Result from fc_ss.show() Result from fc_agg.show()

  • 1
    Could you give a short working example we can reproduce with a createDataframe? Why repartition? – hayj Feb 07 '22 at 14:27
  • 1
    can you share your cluster stats along with working code to reproduce your data? by default, a `.show()` will only print 20 rows. and, the error stack states that your context was shutdown. from my experience, this can be a memory issue. QQ - why `cache()` each and every step? will your memory hold the result(s)? why repartition (the resulting shuffle is very costly in your case)? can you share the full traceback? – samkart Feb 07 '22 at 14:45
  • @hayj Thanks for your response. I am unable to provide a working example. The default parition count was 200 in my spark settings and with some trial and error I realsied that for my table 1000 was an apt partiion count and seemed to finally work – Roopanjali Jasrotia Feb 15 '22 at 14:20
  • @samkart : Thanks for your response. I figured out with some trial and error that 200 which was the default parition count was a bit low so I bumped it up to 1000 and it seemed to work better . Still took about 3 hrs but the job finished. Also caching was a bad idea as you pointed out. Thank you ! – Roopanjali Jasrotia Feb 15 '22 at 14:22

0 Answers0