Clustering a very large dataset of discrete-valued samples

Question

I am trying to cluster (AgglomerativeCluster, kMeans) a very large dataset of the following type:
[0, 0 , 0, 0, 1, 2, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5]
That is, a sample of integers that repeat multiple times.

In short, I would like to pre-process the sample by transforming it into a much shorter list of counts:
[(0, 4), (1, 1), (2, 4), (3, 1), (4, 3), (5, 9)]
and then use this as input for the clustering.

Q: Do you know how to use such a list-of-counts as input for clustering?

My main motivation for doing this is that either sklearn.cluster.KMeans or sklearn.cluster.AgglomerativeClustering will throw an exception when the length of the input array grows beyond 50000. The length of my dataset is in the millions.

I've got up-and-running a data-compression stage where I:

sort
group in equally-sized chunks
calculate average per chunk

and then proceed to use the list of averages as input for clustering. This works. However, the resulting clusters show a dependency on the chunk-size and I've found this choice to be hard to defend.

I've also tried to use each of the value's frequency as a weight, which is something that sklearn.cluster.KMeans appears to allow. However, I am really guessing here what are those weights used for.

Thanks

Have you looked into [MiniBatchKMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html)? Its trained in batches and should scale well. Sample weights should be a good option too. They determine the relevance of each axis when it comes to assigning a given data point to a cluster if i remember correctly — yatu, May 21 '20 at 17:56
Thanks yatu. I have given MiniBatchKMeans a try, but it takes forever, unfortunately. Given the size of my dataset, I think that I require some form of data compression/reduction step before clustering. — Nicolas Gutierrez, May 23 '20 at 07:32

Clustering a very large dataset of discrete-valued samples

0 Answers0