Feature Selection in PySpark

Question

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.

from sklearn.feature_selection import RFECV,RFE

logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)

However, I could not find any article which could show how can I perform recursive feature selection in pyspark.

I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.

Could please someone help me achieve this in pyspark

Good question. I am wondering the same. – Odisseo Mar 30 '19 at 23:50 — Odisseo, Mar 30 '19 at 23:50

score 7 · Answer 1 · edited Nov 29 '18 at 09:26

You have a few options for doing this.

If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.

Thanks Jerry, I would try installing sklearn on each worker node in my cluster — Tushar Mehta, Nov 29 '18 at 20:42

score 5 · Accepted Answer · answered Jan 23 '19 at 07:09

We can try following feature selection methods in pyspark

Chi-Squared selector
Randomforest selector

References:

score 2 · Answer 3 · answered Jan 23 '19 at 05:34

I suggest with stepwise regression model you can easily find the important features and only that dataset them in logistics regression. Stepwise regression works on correlation but it has variations. Below link will help to implement stepwise regression for feature selection. https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

Feature Selection in PySpark

3 Answers3