Clustering of Histogram with (Py)Spark for Data Reduction

Question

I want to cluster different probability distributions in the form of histograms. I have a dataset with >10 M observations. One observation has 5 different histrograms (> 100 feautures). The goal of the clustering is data reduction by creating a codebook / prototypes with whom I can represent the distributions of the initial dataset.

Now I am not certain, what is the best method to do this. Ideas are:

Using the normal k-means algorithm of spark ml with euclidean distances.
Try to implement a different distance measure for k-means on spark (eg. Kullback Leibler, Jennsen Shannon) (https://github.com/derrickburns/generalized-kmeans-clustering or http://www.scalaformachinelearning.com/2015/12/kullback-leibler-divergence-on-apache.html)
Implement a SOM on Spark to cluster the distributions using custom distance functions (not sure if this is possible for a dataset that big. Is it possible to create a own algo on Spark which is performant in an incremental way but needs merging of the resuls in each step?)

How would you rate the ideas? Are they feasible? Am I overlooking a clearly more performant/easy solution? Any hints would be greatly appreciated!

Are the histograms normalized (sum 1) and uniform (same binning for each row)? Would it make sense to treat the 5 different histograms separately? — Has QUIT--Anony-Mousse, Feb 11 '19 at 08:06
As long as your data fits into RAM, I'd explore alternatives to Spark that have better and faster algorithms. For learning the codebook, a sample of the rows should be just as good, e.g., 1 million rows only. — Has QUIT--Anony-Mousse, Feb 11 '19 at 09:19
Thanks for the comments! The histograms are normalized but have two different binnings (2 same - 3 same). I am not sure whether it would make sense to treat them separately. They all represent different parts e.g. Acceleration, Velocity. — MosbyT, Feb 11 '19 at 12:26
So you are proposing something like: Sampling the rows and then use Tensorflow on a machine with lots of ram for training? Wouldn't be a sampling method which selects data based on a similarity measure like Jensen–Shannon divergence most useful for my purpose (of data reduction). — MosbyT, Feb 11 '19 at 12:40
No need to use Tensorflow. On the contrary. Forget these "big data" tools. They only have the slow Lloyd algorithm. But better algorithms (that are not naive parallel, and hence not easy to port to Spark not Tensorflow) are a 100x faster. — Has QUIT--Anony-Mousse, Feb 12 '19 at 01:45
I had the impression that the k-means algorithm of spark (parallel k-means with k-means|| initialization) is quite performant? Or at least performant enough for my data set. I am struggling with the implementation of a performant version of SOM on Spark tho. Question is whether SOM makes no sense for a big data set with >100 features because the algorithmic demands(amount of required iterations etc.) are too high, or SOM makes no sense on Spark, or SOM makes no sense without sampling? Thanks! — MosbyT, Feb 12 '19 at 13:17
I was never convinced of SOM at all. It assumes you have a good similarity already in the input domain. In my experience Spark kmeans is pretty slow. But kmeans is fast (in particular if you set loose tolerance limits), you may just not know how fast it could be... And there *are* result quality differences: here Spark produces much larger errors than sklearn https://stackoverflow.com/questions/50406096/inconsistent-results-with-kmeans-between-apache-spark-and-scikit-learn — Has QUIT--Anony-Mousse, Feb 12 '19 at 23:41

Clustering of Histogram with (Py)Spark for Data Reduction

0 Answers0