Running KMeans clustering in PySpark

Question

it's my very first time trying to run KMeans cluster analysis in Spark, so, I am sorry for a stupid question.

I have a spark dataframe mydataframe with many columns. I want to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values. I want to extract 7 clusters based on just those 2 columns. I've tried:

from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel

# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')

# Build the model (cluster the data)
clusters = KMeans.train(data, 7, maxIterations=15, initializationMode="random")

But I am getting an error:

'DataFrame' object has no attribute 'map'

What should be the object one feeds to KMeans.train? Clearly, it doesn't accept a DataFrame. How should I prepare my data frame for the analysis?

Thank you very much!

score 2 · Accepted Answer · answered Dec 01 '17 at 01:23

2

the method KMeans.train takes as imput an RDD and not a dataframe (data). So, you just have to convert data to rdd: data.rdd. Hope it helps.

answered Dec 01 '17 at 01:23

J.Khoder

66
4

Great, thank you very much! Also, I just discovered a brief mentioning of it here: https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html - so, in addition to being an RDD the RDD should also be cached? – user3245256 Dec 01 '17 at 01:38
Would this suffice?: data_rdd = data.rdd data_rdd.cache() - and then: clusters = KMeans.train(data_rdd, 7, maxIterations=15, initializationMode="random") – user3245256 Dec 01 '17 at 01:42
Yes, it should also be cached (for high speed up), and your statement is enough – J.Khoder Dec 01 '17 at 02:28

Running KMeans clustering in PySpark

1 Answers1

Linked