How to process dataframe for ML using Pyspark

Question

I am doing GBT modelling in using pyspark. I have a dataframe, the features for input (X) are multiple columns: A,B,C the output (Y) is one column with binary values 0 and 1.

I am confused with the VectorAssembler and transform in processing the dataframe for GBT modelling. How could I do it using pyspark?

The code I used in python is like this:

X_train, X_test = df_train.select(features), df_test.select(features) #5083, 1133
y_train, y_test = df_train.select(label), df_test.select(label)

gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
model = gbt.fit(X_train,y_train)

How could I get the X_train and y_train using pyspark?

From your code it looks like the features are in a single column (`features`), is this not the case? — Shaido, Feb 06 '18 at 07:14
So you are trying to convert scikit-learn code in pyspark ? Did you read spark ml documentation ? It's straightforward concerning things like this? — eliasah, Feb 06 '18 at 07:34

score 0 · Answer 1 · answered Feb 06 '18 at 07:41

VectorAssembler accepts two arguments inputcols(cols to be assembled in a vector) and outputcol(name of the output col). A simple example would be

from  pyspark.ml.feature import VectorAssembler
df = spark.createDataFrame([(1, 0, 3, 0),(4,5,6, 1),(7,8,9, 0)], 
              ["a", "b", "c", 'label'])

vecAssembler = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features")

df_va = vecAssembler.transform(df)

The df_VA now contains four columns ['a', 'b', 'c', 'label','features']. Every row in features column contains a vector with values from columns ['a', 'b', 'c']

The GBTClassifier needs to know which columns is the dependent variable. So for this dataframe it would be 'label'. select features and label columns from df_va and fit your model like this.

df_gbt = df_va['label', 'features']
gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="label", seed=42)
gbt.fit(df_gbt)

How to process dataframe for ML using Pyspark

1 Answers1