I have some geographical points defined with latitude, longitude and score and I want to use MLlib K-Means algorithm to make clusters. Is that available with MLlib K-Means and if available, how can I pass the parameters or features to the algorithm .. as far as I found, it reads a text file of double datatype and make clusters based on it.
2 Answers
Do not use k-means on latitude longitude data
Because of distortion. Earth is a sphere, and -180° and +180° do not have a distance of 360°. But even if you are well away from the data line, e.g. all your data is in San Francisco, at Latitude ~37.773972, you have a distortion of over 20%, and this gets worse the further north you go.
Use an algorithm such as HAC or DBSCAN that can be used (in a good implementation, there are many bad implementations) with Haversine distance. For example ELKI has very fast clustering algorithms, and allows different geo-distances. Even with index acceleration, with helps a lot with geo points.
See also this blog post: https://doublebyteblog.wordpress.com/2014/05/16/clustering-geospatial-data/

- 76,138
- 12
- 138
- 194
If you still need to use K-mean of MLlib then the official documentation is at - https://spark.apache.org/docs/latest/ml-clustering.html#k-means
- Build a dataframe containing a column to hold features. Let's say column name is "f" for feature. It can contain other columns too which won't be touched.
- This feature column is of type Vector. You can create a sparse vector using example - https://spark.apache.org/docs/latest/mllib-data-types.html
- If you have words then you can create their vectors using example - https://spark.apache.org/docs/latest/ml-features.html#word2vec
- Once your input dataframe is ready with a column of type vector, instantiate org.apache.spark.ml.clustering.KMeans, set parameters K and seed, fit and predict. You can use this example - https://spark.apache.org/docs/latest/ml-clustering.html#k-means
val trainingDataset = //build a dataframe containing a column "f" of type org.apache.spark.mllib.linalg.Vector
val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("f").setPredictionCol("p")
val model = kmeans.fit(trainingDataset) // your model is ready
//Predict another dataset
val pDataset = //build a dataframe containing a column "f" of type org.apache.spark.mllib.linalg.Vector
val predictions = model.transform(pDataset)
//predictions will contain your prediction on column "p".
There are other examples available inside "example" folder of your local Spark installation.

- 2,046
- 12
- 13