Need help! I am using Spark MLlib, ALS.trainimpilict
. When I am doing grid search, in most cases the code works normally, but at certain parameters, it will stop and shows the error message. Like:
......
Rank 40, reg 1.0, alpha 2.0, the RMSE = 29.7147495287
Rank 40, reg 1.0, alpha 5.0, the RMSE = 30.1937843479
Traceback (most recent call last):
File "/home/ubuntu/test/als.py", line 270, in <module>
File "/home/ubuntu/test/als.py", line 125, in __init__self.models_grid_search()
File "/home/ubuntu/test/als.py", line 195, in models_grid_search
model = ALS.trainImplicit(self.trainData, rank, iterations=self.iterations,
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py", line 201, in trainImplicit
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o1336.trainImplicitALSModel.
:org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 22096.0 failed 4 times,
most recent failure: Lost task 17.3 in stage 22096.0 (TID 25114, 172.31.11.21): java.lang.AssertionError:
assertion failed: lapack.dppsv returned 23.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:393)
at org.apache.spark.ml.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1170)
at org.apache.spark.ml.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1131)
.....
I knew that there was an early post mentioning a solution: Spark gives a StackOverflowError when training using ALS. But when I put sc.setCheckpointDir('checkpoint/')
or ALS.checkpointInterval = 2
("which prevents the recursion used by the codebase from creating an overflow" as explained by the post) in the code, the execution immediately stops, even not giving any train implicit outcome. The error message is
Traceback (most recent call last):
File "/home/ubuntu/test/als_implicit.py", line 275, in <module> engine = ImplicitCF(sc, rank=8, seed=5L, iterations=10,reg_parameter=0.06)
File "/home/ubuntu/test/als_implicit.py", line 129, in __init__self.models_grid_search()
File "/home/ubuntu/test/als_implicit.py", line 200, in models_grid_search lambda_=reg, blocks=-1, alpha=alphas, nonnegative=False, seed=self.seed)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py", line 201, in trainImplicit
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError:
An error occurred while calling o116.trainImplicitALSModel.: org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[232]
at aggregate at ALS.scala:1182(0) has different number of partitions from original RDD itemFactors-10
MapPartitionsRDD[230] at mapValues at ALS.scala:1131(18)
at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply$mcV$sp(RDD.scala:1655)
at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1652)
....
Where should I put the sc.setCheckpointDir('checkpoint/')
and ALS.checkpointInterval = 2
? Should I put right after sc, or before train model? or after train model? The following is my code (for the later error message, with sc.setCheckpointDir('checkpoint/
):
from __future__ import print_function
import sys
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import Rating
from pyspark import SparkContext, SparkConf
class ImplicitCF(object):
def __init__(self, sc, rank, seed, iterations, reg_parameter):
text = sc.textFile(sys.argv[1], 1)
sc.setCheckpointDir('checkpoint/')
self.sc = sc
self.rank = rank
self.seed = seed
self.iterations = iterations
self.reg = reg_parameter
self.models_grid_search(self)
.......
def models_grid_search(self):
for reg in [1.0, 2.0, 5.0]:
for alphas in [0.1, 0.5, 1.0, 2.0, 5.0]:
model = ALS.trainImplicit(self.trainData, rank=self.rank, iterations=self.iterations, lambda_=reg, blocks=-1, alpha=alphas, nonnegative=False, seed=self.seed)
#self.sc.setCheckpointDir('checkpoint/')
#ALS.checkpointInterval = 2
....
if __name__ == "__main__":
sc = SparkContext(appName="implicit_train_test")
engine = ImplicitCF(sc, rank=8, seed=5L, iterations=10, reg_parameter=0.06)
sc.stop()
Thank you very much for any experts' solutions and input. My spark version is v1.5.2
. Meanwhile, as shown above, I am using pyspark.