1

I have some geographical points defined with latitude, longitude and score and I want to use MLlib K-Means algorithm to make clusters. Is that available with MLlib K-Means and if available, how can I pass the parameters or features to the algorithm .. as far as I found, it reads a text file of double datatype and make clusters based on it.

Ahmed El-Gamal
  • 180
  • 3
  • 18

2 Answers2

1

Do not use k-means on latitude longitude data

Because of distortion. Earth is a sphere, and -180° and +180° do not have a distance of 360°. But even if you are well away from the data line, e.g. all your data is in San Francisco, at Latitude ~37.773972, you have a distortion of over 20%, and this gets worse the further north you go.

Use an algorithm such as HAC or DBSCAN that can be used (in a good implementation, there are many bad implementations) with Haversine distance. For example ELKI has very fast clustering algorithms, and allows different geo-distances. Even with index acceleration, with helps a lot with geo points.

See also this blog post: https://doublebyteblog.wordpress.com/2014/05/16/clustering-geospatial-data/

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

If you still need to use K-mean of MLlib then the official documentation is at - https://spark.apache.org/docs/latest/ml-clustering.html#k-means

  1. Build a dataframe containing a column to hold features. Let's say column name is "f" for feature. It can contain other columns too which won't be touched.
  2. This feature column is of type Vector. You can create a sparse vector using example - https://spark.apache.org/docs/latest/mllib-data-types.html
  3. If you have words then you can create their vectors using example - https://spark.apache.org/docs/latest/ml-features.html#word2vec
  4. Once your input dataframe is ready with a column of type vector, instantiate org.apache.spark.ml.clustering.KMeans, set parameters K and seed, fit and predict. You can use this example - https://spark.apache.org/docs/latest/ml-clustering.html#k-means
val trainingDataset = //build a dataframe containing a column "f" of type org.apache.spark.mllib.linalg.Vector


val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("f").setPredictionCol("p")
val model = kmeans.fit(trainingDataset) // your model is ready

//Predict another dataset
val pDataset = //build a dataframe containing a column "f" of type org.apache.spark.mllib.linalg.Vector

val predictions = model.transform(pDataset) 
//predictions will contain your prediction on column "p".

There are other examples available inside "example" folder of your local Spark installation.

Salim
  • 2,046
  • 12
  • 13