Using sklearn-python model in spark ML 2.2.0 for prediction

Question

I am working on a text classification problem in python using sklearn. I have created the model and saved it in a pickle.

Below is the code I used in sklearn.

vectorizerPipe = Pipeline([('tfidf', TfidfVectorizer(lowercase=True,
        stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),])

prd=vectorizerPipe.fit(features_used,labels_used])

f = open(file_path, 'wb')
pickle.dump(prd, f)

Is there any way to use this same pickle to get the output in DataFrame based apache spark and not RDD based. I have gone through the following articles but didn't find a proper way to implement.

what-is-the-recommended-way-to-distribute-a-scikit-learn-classifier-in-spark
how-to-do-prediction-with-sklearn-model-inside-spark -> I found both these questions on StackOverflow and find it useful.

deploy-a-python-model-more-efficiently-over-spark

I am a beginner in Machine learning. So, pardon me If the explanation is naive. Any related example or implementation would be helpful.

were you able to find any solution on how to use scikit-learn trained model on spark clusters? — Rudr, Sep 27 '18 at 03:01
@Rudr : No, we have created a separate model using scala for that particular process. — Sumit S Chawla, Sep 27 '18 at 11:57

score 0 · Answer 1 · answered Dec 11 '19 at 02:50

0

RDD -> spark dataframe using Spark

like:

import spark.implicits._
val testDF = rdd.map {line=>
                      (line._1,line._2)
                     }.toDF("col1","col2")

answered Dec 11 '19 at 02:50

user12515382

1

Using sklearn-python model in spark ML 2.2.0 for prediction

1 Answers1