Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

Question

I am on Dataproc managed spark cluster

OS = Ubuntu 18.04
Spark version = 3.3.0

My cluster configuration is as follows:

Master
- Memory = 7.5 GiB
- Cores = 2
- Primary disk size = 32 GB
Workers
- Cores = 16
- Ram = 16 GiB
- Available to Yarn = 13536 MiB
- Primary disk size = 32 GB

Necessary imports:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

I start the SparkSession with (Notice the change to the maxPartitionBytes):

spark = SparkSession.builder.\
config("spark.executor.cores","15").\
config("spark.executor.instances","2").\
config("spark.executor.memory","12100m").\
config("spark.dynamicAllocation.enabled", False).\
config("spark.sql.adaptive.enabled", False).\
config("spark.sql.files.maxPartitionBytes","10g").\
getOrCreate()

I have a csv file that takes up ~40GiB on the disk.

I read it in and cache with the following:

df_covid = spark.read.csv("gs://xxxxxxxx.appspot.com/spark_datasets/covid60g.csv",
                          header=True, inferSchema=False)
df_covid.cache()
df_covid.count()
df_covid.rdd.getNumPartitions()
#output: 30

The following is my storage tab post that:

10.3GiB deserialized in memory and 3.9 Serialized on disk

Now, I want to check the CPU usage from my YARN UI and compare it with my htop results on individual workers. The issue is:

Dataproc YARN UI has min_alignment_period of 1 min. The datapoints for each minute are combined into a single point and presented. Hence I ensure to create a relatively heavy sequence of transformations that run for more than a minute per partition. This removes other workloads that might consume time (like loading data from storage to execution memory)

I use the following transformations:

@udf(returnType=StringType())
def f1(x):
    out = ''
    for i in x:
        out += chr(ord(i)+1)
    return out

@udf(returnType=StringType())
def f2(x):
    out = ''
    for i in x:
        out += chr(ord(i)-1)
    return out

df_covid = df_covid.withColumn("_catted", F.concat_ws('',*df_covid.columns))

for i in range(10):
    df_covid = df_covid.withColumn("_catted", f1(F.col("_catted")))
    df_covid = df_covid.withColumn("_catted", f2(F.col("_catted")))
df_covid = df_covid.withColumn("esize1", F.length(F.split("_catted", "e").getItem(1)))
df_covid = df_covid.withColumn("asize1", F.length(F.split("_catted", "a").getItem(1)))
df_covid = df_covid.withColumn("isize1", F.length(F.split("_catted", "i").getItem(1)))
df_covid = df_covid.withColumn("nsize1", F.length(F.split("_catted", "n").getItem(1)))
df_covid = df_covid.filter((df_covid.esize1 > 5) & (df_covid.asize1 > 5) & (df_covid.isize1 > 5) & (df_covid.nsize1 > 5))

Now I call an action to start the computations:

df_covid.count()

I monitor htop on my two worker nodes. After a minute of calling the action both the htops show all the cores being fully utilized and they remain fully utilized for about 3-4 minutes

As you can see from the load average from the images my cores are going full-tilt and the 16 cores are getting utilized completely. You can also check from the uptime on the screenshots that the cores are fully utilized for well over 2 minutes. Actually, they get utilized for about 3+ minutes

My issue is that the CPU utilization from the yarn metrics usage on dataproc monitoring doesn't concur. The following are the CPU utilization charts from the same time:

which shows a maximum CPU usage of ~70%.

What is the reason for the discrepancy between the YARN monitoring and htop. I have seen CPU utilization from yarn going 90%+ for other people. A quick google search would show the same as well. How is that achieved?

That's a very interesting question! One thing I'm noticing in your `htop` screens is that if you look at the processes (bottom part of your screen), there seem to be 2 types of processes that we see: processes that have been running for >5h (these are using an average of around 90% and there are 15 of them, the `/opt/conda/miniconda3/...` ones) and processes that seem to be using the remainder (the `/usr/lib/jvm/...` ones). Could you have a look at which exact processes those 2 are? This might give a hint into understanding the CPU usage you're seeing. — Koedlt, Jun 13 '23 at 05:16
I don't think that means hours. Those are measured in hundredth of a second. So, those processes have been running for > 5 minutes — figs_and_nuts, Jun 13 '23 at 07:11
I am not sure what the first process that is sleeping and is still consuming 200% CPU on each executor is. The rest are workers and are in running state. Currently, I am trying to use a cluster with 12 nodes to see if the usage goes up. It could be some fixed cost (application master?) that would drop as a percentage upon scaling the cluster up — figs_and_nuts, Jun 13 '23 at 07:23
This was it. My CPU usage is 93.5% now :). Thank you sooooo much for your valuable guidance. You have always helped me with all my spark-related questions here on stackoverflow, dear friend :). May your loved ones live a 100 years — figs_and_nuts, Jun 13 '23 at 08:11
Ahhhh great, always happy to be of help!! I'm also grateful for you questions, as I learn through them as well :) — Koedlt, Jun 13 '23 at 09:10

score 0 · Accepted Answer · answered Jun 13 '23 at 08:14

0

Spark fixed costs are a significant proportion of the tiny cluster that I was running my queries on. The CPU usage is 93.5% upon scaling up the cluster size to 12 worker nodes of the same configuration

answered Jun 13 '23 at 08:14

figs_and_nuts

4,870
2
31
56

Yarn CPU usage and the result of htop on workers are incosistent. I am running a SPARK cluster on Dataproc

1 Answers1