CPU gap when doing k-means with Spark

Question

I am working with Spark 1.2.0.
My feature vector is about 350 dimensions
The data set is about 24k vectors
The problem I described below only happens to kmeans|| algorithm; I have switched to kmeans-random for now, but I would like to know why kmeans|| doesn't work.

When I call KMeans.train with k=100, I observe this CPU usage gap after Spark has done several collectAsMap calls. As I marked in the red in the image, there are 8 cores, only 1 core is working while the other 7 are at rest during this gap.

If I raise k to 200, the gap will significantly increase.

I want to know why this gap? How to avoid it? Because my work requires me to set k=5000 with a much larger data set. With my current settings, the job never ends...

I have tried my approach with both Windows and Linux (both are 64bit) environment, and I observe the same behavior.

I you want, I could give you the code and sample data.

enter image description here

Marius Soutier · Answer 1 · 2015-03-29T12:34:51.137

2

Have you checked the WebUI, especially GC times? One CPU up, all others down could be a stop-the-world garbage collection.

You might wanna try enabling parallel GC and check the section on GC tuning in the Spark documentation.

Other than that, collectAsMap return the data to the master/driver, so the bigger the data gets, the longer the single driver process will take to process. You should also try increasing spark.driver.memory.

edited Mar 29 '15 at 12:34

answered Mar 29 '15 at 12:29

Marius Soutier

11,184
1
38
48

1

+1. I would like emphasize on the fact that `collectAsMap` returns the data to the master/driver. This could be the root of your *problem*, although it's an implementation issue, more than an actual problem. – Mikel Urkia Mar 30 '15 at 10:54
`collectAsMap` is called inside the `KMeans` algorithm, I cannot control that. – David S. Mar 30 '15 at 22:40
I have checked the GC, the executor & driver memory. Everything looks fine. This problem only happens if I use the `kmeans||` algorithm. – David S. Mar 30 '15 at 22:46
As Mikel suggested, this might not be a problem, but just the way the algorithm works. – Marius Soutier Mar 31 '15 at 06:44

score 1 · Accepted Answer · answered Apr 16 '15 at 00:45

1

Please refer to SPARK-3220 for detail about this issue.

In summary, it is because the default kmeans|| initialize process is not distributed and it is performed on the driver with a single thread.

answered Apr 16 '15 at 00:45

David S.

10,578
12
62
104

CPU gap when doing k-means with Spark

2 Answers2