Pyspark monitoring metrics not making sense

Question

I am trying to understand the spark ui and hdfs ui while using pyspark. Following are my properties for the Session that I am running

pyspark --master yarn --num-executors 4 --executor-memory 6G --executor-cores 3 --conf spark.dynamicAllocation.enabled=false --conf spark.exector.memoryOverhead=2G --conf spark.memory.offHeap.size=2G --conf spark.pyspark.memory=2G

I ran a simple code to read a file (~9 GB on disk) in the memory twice. And, then merge the two files and persist the results and ran a count action.

#Reading the same file twice
df_sales = spark.read.option("format","parquet").option("header",True).option("inferSchema",True).load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
df_sales_copy = spark.read.option("format","parquet").option("header",True).option("inferSchema",True).load("gs://monsoon-credittech.appspot.com/spark_datasets/sales_parquet")
#caching one
from pyspark import StorageLevel
df_sales = df_sales.persist(StorageLevel.MEMORY_AND_DISK)

#merging the two read files
df_merged = df_sales.join(df_sales_copy,df_sales.order_id==df_sales_copy.order_id,'inner')
df_merged = df_merged.persist(StorageLevel.MEMORY_AND_DISK)
#calling an action to trigger the transformations
df_merged.count()

I expect:

The data is to first be persisted in the Memory and then on disk
The HDFS capacity to be utilized at least to the extent that the data persist spilled the data on disk

Both of these expectations are failing in the monitoring that follows:

Expectation 1: Failed. Actually, the data is being persisted on disk first and then in memory maybe. Not sure. The following image should help. Definitely not in the disk first unless im missing something

Expectation 2: Failed. The HDFS capacity is not at all used up (only 1.97 GB)

Can you please help me reconcile my understanding and tell me where I'm wrong in expecting the mentioned behaviour and what it actually is that I'm looking at in those images?

What problem are you trying to solve that requires you to persist this information? — Matt Andruff, Dec 08 '22 at 16:25
Happy to help explain but first I have to ask why you are doing this as it will influence the answer. — Matt Andruff, Dec 08 '22 at 16:32
@MattAndruff I'm simply trying to understand to read the spark UI and hdfs usage metrics and make sense of them. Learning phase. All I did was read the data, persist, read the same data as another object, merge, persist the result. Now I tried to look at the monitoring metrics in light of the parameters and their understanding that I carry. Please let me know how I can help you help me better — figs_and_nuts, Dec 08 '22 at 18:41

score 0 · Answer 1 · answered Dec 08 '22 at 19:16

Don't use persist to disk until you have a really good reason.(You should only performance tune when you have identified a bottle neck) It takes way longer to write to disk than it does to just proceed with processing the data. for that reason persist to disk should only be used when you have a reason. Reading data in for a count is not one of those reasons.

I humbly suggest that don't alter the spark launch parameters unless you have a reason.(And you understand them.) Here your not going to fit your data into memory because of your spark launch configuration. (You split the space into 2 Gig allocations which means you'll never fit 9 gigs into the 6 gigs you have) I think that you should consider removing all your configuration and see how that changes what's used in memory. Playing with these launch configuration will help you learn what each parameter does. That might help you learn more.

Really it's hard to really provide more advice because there is a lot to learn and explain. Perhaps you'll get luck and someone else will answer your question.

I have 4 executors. 6 gigs each. Even if it doesn't fit then it should first fill the memory and then the disk spill should happen — figs_and_nuts, Dec 08 '22 at 20:16

Pyspark monitoring metrics not making sense

1 Answers1