Spark MLlib K-Means Clustering

Question

I have some geographical points defined with latitude, longitude and score and I want to use MLlib K-Means algorithm to make clusters. Is that available with MLlib K-Means and if available, how can I pass the parameters or features to the algorithm .. as far as I found, it reads a text file of double datatype and make clusters based on it.

score 1 · Accepted Answer · answered Aug 10 '16 at 17:31

Do not use k-means on latitude longitude data

Because of distortion. Earth is a sphere, and -180° and +180° do not have a distance of 360°. But even if you are well away from the data line, e.g. all your data is in San Francisco, at Latitude ~37.773972, you have a distortion of over 20%, and this gets worse the further north you go.

Use an algorithm such as HAC or DBSCAN that can be used (in a good implementation, there are many bad implementations) with Haversine distance. For example ELKI has very fast clustering algorithms, and allows different geo-distances. Even with index acceleration, with helps a lot with geo points.

See also this blog post: https://doublebyteblog.wordpress.com/2014/05/16/clustering-geospatial-data/

score 0 · Answer 2 · answered Dec 22 '19 at 17:30

If you still need to use K-mean of MLlib then the official documentation is at - https://spark.apache.org/docs/latest/ml-clustering.html#k-means

Build a dataframe containing a column to hold features. Let's say column name is "f" for feature. It can contain other columns too which won't be touched.
This feature column is of type Vector. You can create a sparse vector using example - https://spark.apache.org/docs/latest/mllib-data-types.html
If you have words then you can create their vectors using example - https://spark.apache.org/docs/latest/ml-features.html#word2vec
Once your input dataframe is ready with a column of type vector, instantiate org.apache.spark.ml.clustering.KMeans, set parameters K and seed, fit and predict. You can use this example - https://spark.apache.org/docs/latest/ml-clustering.html#k-means

val trainingDataset = //build a dataframe containing a column "f" of type org.apache.spark.mllib.linalg.Vector


val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("f").setPredictionCol("p")
val model = kmeans.fit(trainingDataset) // your model is ready

//Predict another dataset
val pDataset = //build a dataframe containing a column "f" of type org.apache.spark.mllib.linalg.Vector

val predictions = model.transform(pDataset) 
//predictions will contain your prediction on column "p".

There are other examples available inside "example" folder of your local Spark installation.

Spark MLlib K-Means Clustering

2 Answers2

Do not use k-means on latitude longitude data