I try to make PySpark operations in Jupyter Notebook, and it seems that there is (a rather low) threshold of working memory when it halts with an error message. The laptop has 16GB RAM (out of which 50% is free when running the script), so the physical memory shouldn't be a problem. Spark runs on JVM (64bit) 1.8.0_301. The Jupyter Notebook runs on Python 3.9.5.
The dataframe consists only of 360K rows and two 'long' type columns (i.e. ca. 3.8MB only). The script works properly if I reduce the size of the dataframe 1.5MB memory usage (49,200 rows). But above that, the script collapses using the df.toPandas() command with the following error message (extract):
Py4JJavaError: An error occurred while calling o234.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 50.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 50.0 (TID 577) (BEXXXXXX.subdomain.domain.com executor driver):
TaskResultLost (result lost from block manager)
This is a well known error message when PySpark runs into memory limits, so I tried to adjust the settings as follows:
In the %SPARK_HOME%/conf/spark-defaults.conf file:
spark.driver.memory 4g
In Jupyter notebook itself:
spark = SparkSession.builder\
.config("spark.driver.memory", "4G")\
.config("spark.driver.maxResultSize", "4G")\
.appName("MyApp")\
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
spark.sparkContext.setSystemProperty('spark.executor.memory', '4G')
I tried to play with the values of spark.driver.memory, spark.executor.memory etc., but the threshold seems to remain the same.
The Spark panel (on http://localhost:4040) says in the Executors menu, that Storage memory is 603 KiB / 2 GiB, Input is 4.1 GiB, Shuffle read: 60.6 MiB, Shuffle write: 111.3 MiB. But this is essentially the same, if I reduce the dataframe size below 1.5MB and the script runs properly.
Have you got any ideas, how to raise this 1.5MB memory limit somehow, and where is it coming from?