1
  • I am working with Spark 1.2.0.
  • My feature vector is about 350 dimensions
  • The data set is about 24k vectors
  • The problem I described below only happens to kmeans|| algorithm; I have switched to kmeans-random for now, but I would like to know why kmeans|| doesn't work.

When I call KMeans.train with k=100, I observe this CPU usage gap after Spark has done several collectAsMap calls. As I marked in the red in the image, there are 8 cores, only 1 core is working while the other 7 are at rest during this gap.

If I raise k to 200, the gap will significantly increase.

I want to know why this gap? How to avoid it? Because my work requires me to set k=5000 with a much larger data set. With my current settings, the job never ends...

I have tried my approach with both Windows and Linux (both are 64bit) environment, and I observe the same behavior.

I you want, I could give you the code and sample data.

enter image description here

David S.
  • 10,578
  • 12
  • 62
  • 104

2 Answers2

2

Have you checked the WebUI, especially GC times? One CPU up, all others down could be a stop-the-world garbage collection.

You might wanna try enabling parallel GC and check the section on GC tuning in the Spark documentation.

Other than that, collectAsMap return the data to the master/driver, so the bigger the data gets, the longer the single driver process will take to process. You should also try increasing spark.driver.memory.

Marius Soutier
  • 11,184
  • 1
  • 38
  • 48
  • 1
    +1. I would like emphasize on the fact that `collectAsMap` returns the data to the master/driver. This could be the root of your *problem*, although it's an implementation issue, more than an actual problem. – Mikel Urkia Mar 30 '15 at 10:54
  • `collectAsMap` is called inside the `KMeans` algorithm, I cannot control that. – David S. Mar 30 '15 at 22:40
  • I have checked the GC, the executor & driver memory. Everything looks fine. This problem only happens if I use the `kmeans||` algorithm. – David S. Mar 30 '15 at 22:46
  • As Mikel suggested, this might not be a problem, but just the way the algorithm works. – Marius Soutier Mar 31 '15 at 06:44
1

Please refer to SPARK-3220 for detail about this issue.

In summary, it is because the default kmeans|| initialize process is not distributed and it is performed on the driver with a single thread.

David S.
  • 10,578
  • 12
  • 62
  • 104