I have a further questions from here https://stackoverflow.com/a/32557330/5235052 I am trying to build labledPoints from a dataframe, where I have the features and label in columns. The features are all boolean with 1/0.
Here is a sample row from the dataframe:
| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1|
#Using the code from above answer,
#create a list of feature names from the column names of the dataframe
df_columns = []
for c in df.columns:
if c == 'is_item_return': continue
df_columns.append(c)
#using VectorAssembler for transformation, am using only first 4 columns names
assembler = VectorAssembler()
assembler.setInputCols(df_columns[0:5])
assembler.setOutputCol('features')
transformed = assembler.transform(df)
#mapping also from above link
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
new_df = transformed.select(col('is_item_return'), col("features")).map(lambda row: LabeledPoint(row.is_item_return, row.features))
When I inspect the contents of the RDD, I get the right label, but the feature vector is wrong.
(0.0,(5,[],[]))
Could someone help me understanding, how to pass the column names of an existing dataframe as feature names to the VectorAssembler?