0

I have a huge pyspark dataframe 800k rows. I tried to collect only 1 cell of a column but it failed. I am running my code on EMR service. It looks that is a memory leak problem.

print(df.collect()[0][1])

I get this error:

An error occurred while calling o199.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 303 tasks (1026.3 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)...

LearnToGrow
  • 1,656
  • 6
  • 30
  • 53
  • @blackbishop, it is the same question as me. However, I don't understand the way that collect work. Here I need only one cell. It seems that collect will go and collect all the data and then return me 1 cell or it will just go and look for 1 cell but this cell is too big? – LearnToGrow Feb 09 '22 at 17:03
  • it collects all the data into python list object then prints the first cell. if you want one row then use select for the columns you want and limit only one row, e.g. `df.select("colx").limit(1).collect()...` – blackbishop Feb 09 '22 at 17:06

0 Answers0