You can use VectorAssempler
in combination with list comprehension
to structure your data for model training. Consider this example data with two feature columns (x1
and x2
) and a response variable y
.
df = sc.parallelize([(5, 1, 6),
(6, 9, 4),
(5, 3, 3),
(4, 4, 2),
(4, 5, 1),
(2, 2, 2),
(1, 7, 3)]).toDF(["y", "x1", "x2"])
First, we create a list of column names that are not "y"
:
colsList = [x for x in df.columns if x!= 'y']
Now, we can use VectorAssembler
:
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler()
vectorizer.setInputCols(colsList)
vectorizer.setOutputCol("features")
output = vectorizer.transform(df)
output.select("features", "y").show()
+---------+---+
| features| y|
+---------+---+
|[1.0,6.0]| 5|
|[9.0,4.0]| 6|
|[3.0,3.0]| 5|
|[4.0,2.0]| 4|
|[5.0,1.0]| 4|
|[2.0,2.0]| 2|
|[7.0,3.0]| 1|
+---------+---+