(py)Spark checkpointing consumes driver memory

Question

Context

I have a pySpark-query that creates a rather large DAG. Thus, I break the lineage using checkpoint(eager=True) to shrink it which normally works. Note: I do not use localCheckpoint() since I use dynamic ressource allocation (see the docs for reference about this).

# --> Pseudo-code! <--

spark = SparkSession()
sc= SparkContext()

# Collect distributed data sources which results in touching a lot of files
# -> Large DAG
df1 = spark.sql("Select some data")
df2 = spark.sql("Select some other data")
df3 ...

# Bring these DataFrames together to break lineage and shorten DAG
# Note: In "eager"-mode this is executed right away
intermediate_results = df1.union(df2).union(df)....

sc.setCheckpointDir("hdfs:/...")
checkpointed_df = intermediate_results.checkpoint(eager=True)

# Now continue to do stuff
df_X = spark.sql("...")

result = checkpointed_df.join(df_X ...)

Problem

I start the Spark-session in client-mode (admin-requirement) in a Docker container in a Kubernetes cluster (respectively some third party product manages this as set up by the admins).

When I execute my code and intermediate_results.checkpoint(eager=True) two things happen:

I receive a pySpark-error about loosing the connection to the JVMs and a resulting calling-error:

py4j.protocol.Py4JNetworkError: Answer from Java side is empty ... Py4JError: An error occurred while calling o1540.checkpoint

This is of course a very shortened StackTrace.

The software controlling the Docker states:

Engine exhausted available memory, consider a larger engine size.

This refers to an exceeded memory-limit of the container.

Question

The only reason I can explain myself that the Docker-containers memory-limit is exceeded would be that checkpoint() actually passes data through the driver at some point. Otherwise, I have no action which would collect anything to the driver on purpose. However, I didn't read anything about it in the docs.

Does checkpoint() actually consume memory in the driver when executed?
Did anybody encounter a similar error-behaviour and can pin out that this is deriving from something else?

can the memory being talked about is actually the disk space (considering you're using kubernetes)? — samkart, Aug 10 '22 at 13:18
No - Disk space is not consumed / does not change. The error from the container is definitively about memory (its the Cloudera Data Science Workbench). — Markus, Aug 10 '22 at 16:13

(py)Spark checkpointing consumes driver memory

0 Answers0