So I guessed enough times and finally solved this, there were quite a few weird things I had to do to get it to work, so I feel it's worth sharing:
I created a simple csv like so:
time,is_boy,is_girl
123,1.0,0.0
132,1.0,0.0
135,0.0,1.0
139,0.0,1.0
140,1.0,0.0
Then I created a hive table, executing this query in hue:
CREATE EXTERNAL TABLE pollab02.experiment_raw(
`time` double,
`is_boy` double,
`is_girl` double)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
serdeproperties( 'separatorChar' = ',' )
STORED AS TEXTFILE LOCATION "/user/me/hive/experiment"
TBLPROPERTIES ("skip.header.line.count"="1", "skip.footer.line.count"="0")
Then my pyspark script was as follows:
(I'm assuming a SparkSession has been created with the name "spark")
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
raw_data = spark.sql("select * from dbname.experiment_raw")
#filter out row of null values that were added for some reason
raw_data_filtered=raw_data.filter(raw_data.time>-1)
#convert rows of strings to doubles for kmeans:
data=raw_data_filtered.select([col(c).cast("double") for c in raw_data_filtered.columns])
cols = data.columns
#Merge data frame with column called features, that contains all data as a vector in each row
vectorAss = VectorAssembler(inputCols=cols, outputCol="features")
vdf=vectorAss.transform(data)
kmeans = KMeans(k=2, maxIter=10, seed=1)
model = kmeans.fit(vdf)
and the rest is history. I haven't done best best practices here. We could maybe drop some columns that we don't need from the vdf DataFrame to save space and improve performance, but this works.