I am trying to cluster (AgglomerativeCluster, kMeans) a very large dataset of the following type:
[0, 0 , 0, 0, 1, 2, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5]
That is, a sample of integers that repeat multiple times.
In short, I would like to pre-process the sample by transforming it into a much shorter list of counts:
[(0, 4), (1, 1), (2, 4), (3, 1), (4, 3), (5, 9)]
and then use this as input for the clustering.
Q: Do you know how to use such a list-of-counts as input for clustering?
My main motivation for doing this is that either sklearn.cluster.KMeans or sklearn.cluster.AgglomerativeClustering will throw an exception when the length of the input array grows beyond 50000. The length of my dataset is in the millions.
I've got up-and-running a data-compression stage where I:
- sort
- group in equally-sized chunks
- calculate average per chunk
and then proceed to use the list of averages as input for clustering. This works. However, the resulting clusters show a dependency on the chunk-size and I've found this choice to be hard to defend.
I've also tried to use each of the value's frequency as a weight, which is something that sklearn.cluster.KMeans appears to allow. However, I am really guessing here what are those weights used for.
Thanks