2

I'm running k-means on a big data set. I set it up like this:

from sklearn.cluster import KMeans
km = KMeans(n_clusters=500, max_iter = 1, n_init=1, 
  init = 'random', precompute_distances = 0, n_jobs = -2)

# The following line computes the fit on a matrix "mat"
km.fit(mat)

My machine has 8 cores. The documentation says "for n_jobs = -2, all CPUs but one are used." I can see that there several extra Python processes running while km.fit is executing, but only one CPU gets used.

Does this sound like a GIL issue? If so, is there any way to get all CPU's to work? (It seems like there must be ... otherwise what is the point of the n_jobs argument).

I'm guessing I'm missing something basic and someone can either confirm my fear or get me back on track; if it's actually more involved, I'll turn to setting up a working example.

Update 1. For simplicity, I switched n_jobs to be positive 2. Here is what's going on with my system during execution:

enter image description here

Actually I'm not the only user on the machine, but

free | grep Mem | awk '{print $3/$2 * 100.0}'

indicates that 88% of RAM is free (confusing to me, since the RAM usage looks like at least 27% on the screenshot above).

Update 2. I updated sklearn version to 0.15.2, and nothing changed in the top output reported above. Experimenting with different values of n_jobs similarly gives no improvement.

zkurtz
  • 3,230
  • 7
  • 28
  • 64
  • 1
    Not a GIL issue, because `KMeans` will spawn processes, not threads. How much data are you feeding in? Do you have enough memory? Which scikit-learn version? Did you try `n_jobs=-1` or `n_jobs=2` (just to verify)? – Fred Foo Oct 16 '14 at 15:06
  • See the update. The data is about 3 gigs csv read in via pandas -> numpy, while machine RAM is 24 gigs; I can't see how memory is the issue. Current update uses `n_jobs = 2`. – zkurtz Oct 16 '14 at 15:45
  • Version: scikit-learn==0.14.1 – zkurtz Oct 16 '14 at 15:49
  • That's an old version. K-means was optimized a lot in 0.15. – Fred Foo Oct 16 '14 at 15:50
  • @larsmans the version is updated, still no luck. – zkurtz Oct 16 '14 at 20:00
  • How big is "big data set" in your case? How long does it take? – Has QUIT--Anony-Mousse Oct 18 '14 at 23:35
  • By "big" I mean big enough that I'm pretty sure the times I observe are dominated by the computations I'm interested rather than overhead. I'm using about 10 million rows and 20 variables; this runs in about 200 seconds. – zkurtz Oct 20 '14 at 16:03

1 Answers1

3

The parallelism for KMeans is just running multiple initializations in parallel. As you set n_init=1, there is only one initialization and nothing to parallelize over. The docstring for n_jobs seems wrong at the moment. I'm not sure what happened there.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • 1
    Fixed it (should be on the website soonish). – Andreas Mueller Mar 20 '15 at 15:32
  • Interesting, I had not thought of that possibility. That is disappointing, if true, because even within a single initialization, I expect that there is much to parallelize over. There are a huge number of distance computations between each of the 500 (in my case) initial centers and all the rest of the points. – zkurtz Mar 20 '15 at 18:20
  • These are all done via BLAS, and if your BLAS is multicore, they will be parallelized. – Andreas Mueller Mar 23 '15 at 13:16