How to preprocess csv data for Spark 2.0 clustering?

Question

I have a very simple csv file that looks like this:

time,is_boy,is_girl
135,1,0
136,0,1
137,0,1

I have this csv file sitting in a Hive table also, where all the values have been created as doubles in the table.

Behind the scenes, this table is actually enormous, and has an enormous number of rows, so I have chosen to use Spark 2 to solve this problem.

I would like to use this clustering library, with Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html

If anyone knows how to load this data, either directly from the csv or by using some Spark SQL magic, and preprocess it correctly, using Python, so that it can be passed into the kmeans fit() method and calculate a model, I would be very grateful. I also think it would be useful for others as I haven't found an example for csvs and for this library yet.

Spark has a builtin CSV reader and SparkSQL can interact with Hive (not magic, it's well documented) . Please show what you've tried — OneCricketeer, Dec 12 '17 at 14:09
You can still [edit] your question to include what you tried, and show the research you've done, because, as written, that isn't clear — OneCricketeer, Dec 12 '17 at 16:22
Regarding your dataset, is it safe to assume gender is a binary feature? If so, you only need `is_boy = {0, 1}`. In other words, when are you going to have `is_boy = 0` & `is_girl = 0`? — OneCricketeer, Dec 12 '17 at 16:24
For the *very general* idea, see [KMeans clustering in PySpark](https://stackoverflow.com/questions/47585723/kmeans-clustering-in-pyspark/47593712#47593712) — desertnaut, Dec 13 '17 at 16:50

OneCricketeer · Answer 1 · 2017-12-12T14:25:32.510

0

The fit method just takes a vector / Dataframe

spark.read().csv or spark.sql both return you a Dataframe.

However you want to preprocess your data, read over the Dataframe documentation before getting into the MlLib / Kmeans examples

edited Dec 12 '17 at 14:25

answered Dec 12 '17 at 14:17

OneCricketeer

179,855
19
132
245

grumpy_data_engineer · Accepted Answer · 2017-12-14T13:59:08.617

So I guessed enough times and finally solved this, there were quite a few weird things I had to do to get it to work, so I feel it's worth sharing:

I created a simple csv like so:

time,is_boy,is_girl
123,1.0,0.0
132,1.0,0.0
135,0.0,1.0
139,0.0,1.0
140,1.0,0.0

Then I created a hive table, executing this query in hue:

    CREATE EXTERNAL TABLE pollab02.experiment_raw(  
        `time` double,
        `is_boy` double,
        `is_girl` double) 
   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with 
   serdeproperties( 'separatorChar' = ',' ) 
   STORED AS TEXTFILE LOCATION "/user/me/hive/experiment" 
   TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="0")

Then my pyspark script was as follows: (I'm assuming a SparkSession has been created with the name "spark")

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler

raw_data = spark.sql("select * from dbname.experiment_raw")

#filter out row of null values that were added for some reason
raw_data_filtered=raw_data.filter(raw_data.time>-1)

#convert rows of strings to doubles for kmeans:
data=raw_data_filtered.select([col(c).cast("double") for c in raw_data_filtered.columns])
cols = data.columns

#Merge data frame with column called features, that contains all data as a vector in each row
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf=vectorAss.transform(data)
kmeans = KMeans(k=2, maxIter=10, seed=1)
model = kmeans.fit(vdf)

and the rest is history. I haven't done best best practices here. We could maybe drop some columns that we don't need from the vdf DataFrame to save space and improve performance, but this works.

How to preprocess csv data for Spark 2.0 clustering?

2 Answers2