1

I want to create a decision tree model using spark submit.

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark import SparkConf, SparkContext
from numpy import array
from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/newumc.classification_data") \
.getOrCreate()

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

dt = df.rdd.map(createLabeledPoints)

model_dt = DecisionTree.trainClassifier(dt, numClasses=467, categoricalFeaturesInfo={0:2,1:2, 2:2, 3:2, 4:2, 5:2, 6:2, 7:2, 8:2, 9:2, 10:2, 11:2, 12:2, 13:2, 14:2, 15:2, 16:2, 17:2, 18:2, 19:2, 20:2, 21:2, 22:2, 23:2, 24:2, 25:2, 26:2, 27:2, 28:2, 29:2, 30:2, 31:2, 32:2, 33:2, 34:2, 35:2, 36:2, 37:2, 38:2}, impurity='gini', maxDepth=30, maxBins=32)

where createLabeledPoints is a function that return to me a labeledpoint

I have no issue when I execute this code using pyspark in the spark-shell but I want to use spark-submit, when I do that its gives me this error

pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects

I think the problem is because I create another sparkSession inside spark-submit (I think) or because pysparksataframe cannot be pickled! Can anyone please help me !

betty bth
  • 33
  • 7

0 Answers0