I am following spark tutorial from Watson Studio Gallery on IBM Cloud (https://eu-de.dataplatform.cloud.ibm.com/exchange/public/entry/view/99b857815e69353c04d95daefb3b91fa?context=cpdaas) and run into Java stack overflow problem:
Py4JJavaError: An error occurred while calling o20418.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError
java.lang.StackOverflowError
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:516)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1154)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
The problem line :
cvModel = crossval.fit(trainingRatings)
The problem cell:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
(trainingRatings, validationRatings) = ratings.randomSplit([80.0, 20.0])
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
paramGrid = ParamGridBuilder().addGrid(als.rank, [1, 5, 10]).addGrid(als.maxIter, [20]).addGrid(als.regParam, [0.05, 0.1, 0.5]).build()
crossval = CrossValidator(estimator=als, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=10)
cvModel = crossval.fit(trainingRatings)
predictions = cvModel.transform(validationRatings)
print('The root mean squared error for our model is: {}'.format(evaluator.evaluate(predictions.na.drop())))
Environment used: Default Spark 3.2 & Python 3.9
Will be grateful for any help.