- I am working with Spark 1.2.0.
- My feature vector is about 350 dimensions
- The data set is about 24k vectors
- The problem I described below only happens to
kmeans||
algorithm; I have switched tokmeans-random
for now, but I would like to know whykmeans||
doesn't work.
When I call KMeans.train
with k=100, I observe this CPU usage gap after Spark has done several collectAsMap
calls. As I marked in the red in the image, there are 8 cores, only 1 core is working while the other 7 are at rest during this gap.
If I raise k to 200, the gap will significantly increase.
I want to know why this gap? How to avoid it? Because my work requires me to set k=5000 with a much larger data set. With my current settings, the job never ends...
I have tried my approach with both Windows and Linux (both are 64bit) environment, and I observe the same behavior.
I you want, I could give you the code and sample data.