1

I am working on a text classification problem in python using sklearn. I have created the model and saved it in a pickle.

Below is the code I used in sklearn.

vectorizerPipe = Pipeline([('tfidf', TfidfVectorizer(lowercase=True,
        stop_words='english')),
('classification', OneVsRestClassifier(LinearSVC(penalty='l2', loss='hinge'))),])

prd=vectorizerPipe.fit(features_used,labels_used])

f = open(file_path, 'wb')
pickle.dump(prd, f)

Is there any way to use this same pickle to get the output in DataFrame based apache spark and not RDD based. I have gone through the following articles but didn't find a proper way to implement.

  1. what-is-the-recommended-way-to-distribute-a-scikit-learn-classifier-in-spark

  2. how-to-do-prediction-with-sklearn-model-inside-spark -> I found both these questions on StackOverflow and find it useful.

deploy-a-python-model-more-efficiently-over-spark

I am a beginner in Machine learning. So, pardon me If the explanation is naive. Any related example or implementation would be helpful.

Sumit S Chawla
  • 3,180
  • 1
  • 14
  • 33

1 Answers1

0

RDD -> spark dataframe using Spark

like:

import spark.implicits._
val testDF = rdd.map {line=>
                      (line._1,line._2)
                     }.toDF("col1","col2")