Generating high dimensional datasets with Scikit-Learn

Question

I am working with the Mean Shift clustering algorithm, which is based on the kernel density estimate of a dataset. I would like to generate a large, high dimensional dataset and I thought the Scikit-Learn function make_blobs would be suitable. But when I try to generate a 1 million point, 8 dimensional dataset, I end up with almost every point being treated as a separate cluster.

I am generating the blobs with standard deviation 1, and then setting the bandwidth for the Mean Shift to the same value (I think this makes sense, right?). For two dimensional datasets this produced fine results, but for higher dimensions I think I'm running into the curse of dimensionality in that the distance between points becomes too big for meaningful clustering.

Does anyone have any tips/tricks on how to get a good high-dimensional dataset that is suitable for (something like) Mean Shift clustering? (or am I doing something wrong? (which is of course a good possibility))

score 1 · Accepted Answer · answered Mar 19 '15 at 16:04

1

The standard deviation of the clusters isn't 1.

You have 8 dimensions, each of which has a stddev of 1, so you have a total standard deviation of sqrt(8) or something like that.

Kernel density estimation does not work well in high-dimensional data because of bandwidth problems.

answered Mar 19 '15 at 16:04

Has QUIT--Anony-Mousse

76,138
12
138
194

Yes I figured it was something like this. But particularly because I know the standard deviation I thought it would be ok as bandwidth estimation. So how would you compute the total standard deviation? I don't quite see how you get sqrt(8) – danielvdende Mar 19 '15 at 16:09
But due to the curse of dimensionality, it may still fail to work. I read somewhere that KDE tends to break down already at 6 dimensions or so. – Has QUIT--Anony-Mousse Mar 19 '15 at 16:19
Hmm ok, well I'll see if that happens. Thanks :). One more thing: my intuition was that the bandwidth that I should use would be something like the square root of the sum of squares of all 1d standard deviation. So, if I chose stddev 2, it would mean total stddev of sqrt((8*2^2)). Does that make sense? – danielvdende Mar 19 '15 at 16:24

Generating high dimensional datasets with Scikit-Learn

1 Answers1