I want to cluster different probability distributions in the form of histograms. I have a dataset with >10 M observations. One observation has 5 different histrograms (> 100 feautures). The goal of the clustering is data reduction by creating a codebook / prototypes with whom I can represent the distributions of the initial dataset.
Now I am not certain, what is the best method to do this. Ideas are:
- Using the normal k-means algorithm of spark ml with euclidean distances.
- Try to implement a different distance measure for k-means on spark (eg. Kullback Leibler, Jennsen Shannon) (https://github.com/derrickburns/generalized-kmeans-clustering or http://www.scalaformachinelearning.com/2015/12/kullback-leibler-divergence-on-apache.html)
- Implement a SOM on Spark to cluster the distributions using custom distance functions (not sure if this is possible for a dataset that big. Is it possible to create a own algo on Spark which is performant in an incremental way but needs merging of the resuls in each step?)
How would you rate the ideas? Are they feasible? Am I overlooking a clearly more performant/easy solution? Any hints would be greatly appreciated!