-2

I have a very simple csv file that looks like this:

time,is_boy,is_girl
135,1,0
136,0,1
137,0,1

I have this csv file sitting in a Hive table also, where all the values have been created as doubles in the table.

Behind the scenes, this table is actually enormous, and has an enormous number of rows, so I have chosen to use Spark 2 to solve this problem.

I would like to use this clustering library, with Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html

If anyone knows how to load this data, either directly from the csv or by using some Spark SQL magic, and preprocess it correctly, using Python, so that it can be passed into the kmeans fit() method and calculate a model, I would be very grateful. I also think it would be useful for others as I haven't found an example for csvs and for this library yet.

2 Answers2

0

The fit method just takes a vector / Dataframe

spark.read().csv or spark.sql both return you a Dataframe.

However you want to preprocess your data, read over the Dataframe documentation before getting into the MlLib / Kmeans examples

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
0

So I guessed enough times and finally solved this, there were quite a few weird things I had to do to get it to work, so I feel it's worth sharing:

I created a simple csv like so:

time,is_boy,is_girl
123,1.0,0.0
132,1.0,0.0
135,0.0,1.0
139,0.0,1.0
140,1.0,0.0

Then I created a hive table, executing this query in hue:

    CREATE EXTERNAL TABLE pollab02.experiment_raw(  
        `time` double,
        `is_boy` double,
        `is_girl` double) 
   ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with 
   serdeproperties( 'separatorChar' = ',' ) 
   STORED AS TEXTFILE LOCATION "/user/me/hive/experiment" 
   TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="0")

Then my pyspark script was as follows: (I'm assuming a SparkSession has been created with the name "spark")

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler

raw_data = spark.sql("select * from dbname.experiment_raw")

#filter out row of null values that were added for some reason
raw_data_filtered=raw_data.filter(raw_data.time>-1)

#convert rows of strings to doubles for kmeans:
data=raw_data_filtered.select([col(c).cast("double") for c in raw_data_filtered.columns])
cols = data.columns

#Merge data frame with column called features, that contains all data as a vector in each row
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf=vectorAss.transform(data)
kmeans = KMeans(k=2, maxIter=10, seed=1)
model = kmeans.fit(vdf)

and the rest is history. I haven't done best best practices here. We could maybe drop some columns that we don't need from the vdf DataFrame to save space and improve performance, but this works.