How to extend the memory limit of PySpark running locally on Windows 10 / JVM 64bit

Question

I try to make PySpark operations in Jupyter Notebook, and it seems that there is (a rather low) threshold of working memory when it halts with an error message. The laptop has 16GB RAM (out of which 50% is free when running the script), so the physical memory shouldn't be a problem. Spark runs on JVM (64bit) 1.8.0_301. The Jupyter Notebook runs on Python 3.9.5.

The dataframe consists only of 360K rows and two 'long' type columns (i.e. ca. 3.8MB only). The script works properly if I reduce the size of the dataframe 1.5MB memory usage (49,200 rows). But above that, the script collapses using the df.toPandas() command with the following error message (extract):

Py4JJavaError: An error occurred while calling o234.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 50.0 failed 1 times, most recent failure: 
Lost task 0.0 in stage 50.0 (TID 577) (BEXXXXXX.subdomain.domain.com executor driver): 
TaskResultLost (result lost from block manager)

This is a well known error message when PySpark runs into memory limits, so I tried to adjust the settings as follows:

In the %SPARK_HOME%/conf/spark-defaults.conf file:

spark.driver.memory                4g

In Jupyter notebook itself:

spark = SparkSession.builder\
    .config("spark.driver.memory", "4G")\
    .config("spark.driver.maxResultSize", "4G")\
    .appName("MyApp")\
    .getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.setSystemProperty('spark.executor.memory', '4G')

I tried to play with the values of spark.driver.memory, spark.executor.memory etc., but the threshold seems to remain the same.

The Spark panel (on http://localhost:4040) says in the Executors menu, that Storage memory is 603 KiB / 2 GiB, Input is 4.1 GiB, Shuffle read: 60.6 MiB, Shuffle write: 111.3 MiB. But this is essentially the same, if I reduce the dataframe size below 1.5MB and the script runs properly.

Have you got any ideas, how to raise this 1.5MB memory limit somehow, and where is it coming from?

would you try to run this command `pyspark --driver-memory=4g` and try to run `df.toPandas()` from terminal? — pltc, Aug 01 '21 at 22:30
Thanks pltc your comment. In the meantime I could solve it by (1) making a temporary save and reload after some manipulations, so that the plan is executed and I can open a clean state (2) when saving a parquet file, setting repartition() to a high number (e.g. 100) (3) always saving these temporary files into an empty folders, so that there is no conflict between file saving threads. Now it works perfectly. — Attila Marossy, Sep 02 '21 at 15:38

How to extend the memory limit of PySpark running locally on Windows 10 / JVM 64bit

0 Answers0