I am doing GBT modelling in using pyspark. I have a dataframe, the features for input (X) are multiple columns: A,B,C the output (Y) is one column with binary values 0 and 1.
I am confused with the VectorAssembler
and transform in processing the dataframe for GBT modelling. How could I do it using pyspark?
The code I used in python is like this:
X_train, X_test = df_train.select(features), df_test.select(features) #5083, 1133
y_train, y_test = df_train.select(label), df_test.select(label)
gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
model = gbt.fit(X_train,y_train)
How could I get the X_train and y_train using pyspark?