I have trained a model in python using sklearn. How we can use same model to load in Spark and generate predictions on a spark RDD ?
Asked
Active
Viewed 5,245 times
1 Answers
15
Well,
I will show an example of linear regression in Sklearn and show you how to use that to predict elements in Spark RDD.
First training the model with sklearn example:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
Here we just have the fit, and you need to predict each data from an RDD.
Your RDD in this case should be a RDD with X like this:
rdd = sc.parallelize([1, 2, 3, 4])
So you first need to broadcast your model of sklearn:
regr_bc = self.sc.broadcast(regr)
Then you can use it to predict your data like this:
rdd.map(lambda x: (x, regr_bc.value.predict(x))).collect()
So your element in the RDD is your X and the seccond element is going to be your predicted Y. The collect will return somthing like this:
[(1, 2), (2, 4), (3, 6), ...]

Thiago Baldim
- 7,362
- 3
- 29
- 51
-
1collect() saves onto a local driver; so what's the alternative in case of a large dataset? – Rudr Sep 27 '18 at 03:05
-
1Hello, so this is just and example to get quick response. The collect is just there to show the results at the screen. I would suggest to you uses the `write()` to save in your hadoop cluster os s3 bucket for large datasets. – Thiago Baldim Sep 27 '18 at 03:54
-
I would suggest `mapPartition` instead as that will allow you to predict in batches and avoid some overhead – eggie5 Oct 22 '19 at 13:05