0

I tried to fit a random forest classifier in pyspark but i'm getting this error:

Py4JJavaError: An error occurred while calling o767.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 1 times, most recent failure: Lost task 0.0 in stage 30.0 (TID 853, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space

Can anyone help me please?

My code :

from pyspark.ml.tuning import ParamGridBuilder

rf = RandomForestClassifier(labelCol="label", featuresCol="features")

paramGrid = (ParamGridBuilder()
             .addGrid(rf.numTrees, [100])
             .build())

crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=10)
cvModel = crossval.fit(trainingData)
predictions = crossval.transform(testData)
predictions.printSchema()
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
  • Can you post more lines of stack trace? If you're running on some Ipython-like env (like Databricks) I think it is useful to split these commands into cells, so you know for sure where did it happen – luk Jun 25 '20 at 19:50
  • im running the notebook in google colab, the problem is in the fit method ( cvModel = crossval.fit(trainingData ), im not sure if the amount of data is the cause of this problem because i have about 1 million row. – Kousseila Rekkam Jun 26 '20 at 11:17
  • This is the error : Py4JJavaError Traceback (most recent call last) in () evaluator=BinaryClassificationEvaluator(), numFolds=10) ---> cvModel = crossval.fit(trainingData) predictions = crossval.transform(testData) predictions.printSchema() caused by: java.lang.OutOfMemoryError: Java heap space – Kousseila Rekkam Jun 26 '20 at 11:17
  • As you can see the problem is the fit method – Kousseila Rekkam Jun 26 '20 at 11:21
  • What type of spark mode is it? Is it working in local or standalone mode? I think Colab doesn’t give a lot of flexibility here. What about trying Databricks or other platform where you can increase number of nodes vel executors? And how big is the dataset what is its shape – luk Jun 27 '20 at 21:24

0 Answers0