Context
I have a pySpark-query that creates a rather large DAG. Thus, I break the lineage using checkpoint(eager=True)
to shrink it which normally works.
Note: I do not use localCheckpoint()
since I use dynamic ressource allocation (see the docs for reference about this).
# --> Pseudo-code! <--
spark = SparkSession()
sc= SparkContext()
# Collect distributed data sources which results in touching a lot of files
# -> Large DAG
df1 = spark.sql("Select some data")
df2 = spark.sql("Select some other data")
df3 ...
# Bring these DataFrames together to break lineage and shorten DAG
# Note: In "eager"-mode this is executed right away
intermediate_results = df1.union(df2).union(df)....
sc.setCheckpointDir("hdfs:/...")
checkpointed_df = intermediate_results.checkpoint(eager=True)
# Now continue to do stuff
df_X = spark.sql("...")
result = checkpointed_df.join(df_X ...)
Problem
I start the Spark-session in client-mode (admin-requirement) in a Docker container in a Kubernetes cluster (respectively some third party product manages this as set up by the admins).
When I execute my code and intermediate_results.checkpoint(eager=True)
two things happen:
- I receive a pySpark-error about loosing the connection to the JVMs and a resulting calling-error:
py4j.protocol.Py4JNetworkError: Answer from Java side is empty ... Py4JError: An error occurred while calling o1540.checkpoint
This is of course a very shortened StackTrace.
- The software controlling the Docker states:
Engine exhausted available memory, consider a larger engine size.
This refers to an exceeded memory-limit of the container.
Question
The only reason I can explain myself that the Docker-containers memory-limit is exceeded would be that checkpoint()
actually passes data through the driver at some point. Otherwise, I have no action which would collect anything to the driver on purpose. However, I didn't read anything about it in the docs.
- Does
checkpoint()
actually consume memory in the driver when executed? - Did anybody encounter a similar error-behaviour and can pin out that this is deriving from something else?