0

I have a databricks cluster on AWS, with minimum two nodes and maximum 8. Here's a picture of my cluster enter image description here

I have cached a dataframe, and under SparkUI on storage tab I see it's 6.7 GB

enter image description here

So I would expect that if I go to ganglia's UI, I would see that I am using 6.7GB of memory on the cluster (maybe a bit more or less for metadata and other stuff).

But when I open ganglia's UI, I see this

Ganglia's UI

Here is a close up of the graph on memory usage of the cluster:

Memory usage cluster

So I don't understand how the 6.7GB that spark's UI says I have cached, vs what this graph is telling me.

The graph says I have 22GB in use (from what???) and that there's a "cache" on 3.2GB which is not near the 6.7GB of my cached dataframe... So where are the 6.7GB? are they inside the 22GB? And if so, what's all the rest?

Therefore, my main question is: How does spark's cached dataframes fall into ganglia's cluster memory usage graph? In the simplest world I would have expected ganglia to say "Here are cached dataframes of 6.7GB)"

And a couple of questions more:

You can see from ganglia I have three nodes, but if you check the cluster definition image, it says currently there's only two. I have checked again to see both images are captured at the exact same time and they do. So, why three instances when the cluster says it's using 2? I have checked AWS EC2 list and there's indeed 3.

Finally, you can also see from the cluster definition image that my instances have each 30GB RAM. But on ganglia's UI it says the total memory is 74.7GB. How so if each instance is 30GB? Shouldn't it be in multiples of 30GB as per how many nodes there are in a moment? Also same question about CPUs... Ganglia on the top left corner says 52 CPUs, while also displaying it has only 3 nodes. But each node by the cluster definition image says each instance has 4 CPUs (maybe it's a virtual vs physical CPU's thing? )

So to summarize, I have four questions:

  1. How Spark's UI report on cached data (6.7GB) matches ganglia's memory report
  2. Why ganglia's says 3 nodes (and AWS also has 3 EC2s) when the cluster says it currently has 2?
  3. Why ganglia's cluster memory says 74.7GB when each instance on the cluster has 30GB?
  4. Why ganglia's dashboard says my cluster has 52 CPUs, while displaying 3 nodes and each node having 4 CPUs as per the cluster definition image.
  • Cache on the last graph is amount of memory used for cached data (like file cache, etc.) - it isn't related to Spark cache directly – Alex Ott Jul 05 '23 at 09:03
  • So it's just like a overall system cache, right? – Eugenio.Gastelum96 Jul 05 '23 at 18:21
  • 1
    on this graph, yes - it's a system cache – Alex Ott Jul 05 '23 at 18:35
  • So to add answers, on the question about why I saw 3 nodes on Ganglia despite the cluster says 2 workers. That's because clusters also have a driver node, additional to the worker nodes. In databricks I thought the number of workers contained the driver but it does not. So I have 2 workers + 1 driver = 3 nodes – Eugenio.Gastelum96 Jul 16 '23 at 03:41

0 Answers0