0

I performed a k-means on a very large data, which has millions of rows and each row contains a 48-dimension vector. By applying k = 3, these data are clustered into three class, each class has a 48-dimension clustering center vector. I plot three clustering center vector in a form of Parallel Coordinates Plot. It seems the the three line is separated well. However I also want to know each cluster's extent (aka. upper band and lower band or "error band").So how should I get the upper band and lower band of the clustering center?
Because each cluster contains nearly millions vector so it is difficult to plot them in a graph as a background and plot the clustering center on top of it.
Thanks a lot.

Xi Wang
  • 21
  • 5

1 Answers1

0

Well, you can certainly afford to also plot in each axis:

  • the minimum and maximum
  • the upper and lower quartiles (a million values for into RAM easily, and can be sorted)
  • the standard deviation
  • the standard error of the mean

Make sure you understand the statiatical meaning of each of these pairings.

With the minimum and maximum you'd expect bands to overlap, unless there is a dominating feature. The standard error of the mean is likely too tight to be useful (it indicates how much the mean is expected to change if you add a data point, so any cluster difference in this range is entirely random, but the clusters aren't independent).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194