Spark MLlib error on trainimplicit

Question

Need help! I am using Spark MLlib, ALS.trainimpilict. When I am doing grid search, in most cases the code works normally, but at certain parameters, it will stop and shows the error message. Like:

......
Rank 40, reg 1.0, alpha 2.0, the RMSE = 29.7147495287 
Rank 40, reg 1.0, alpha 5.0, the RMSE = 30.1937843479
Traceback (most recent call last):
  File "/home/ubuntu/test/als.py", line 270, in <module>

  File "/home/ubuntu/test/als.py", line 125, in __init__self.models_grid_search()
  File "/home/ubuntu/test/als.py", line 195, in models_grid_search
  model = ALS.trainImplicit(self.trainData, rank, iterations=self.iterations,
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py", line 201, in trainImplicit
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
  py4j.protocol.Py4JJavaError: An error occurred while calling o1336.trainImplicitALSModel.
  :org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 22096.0 failed 4 times, 
  most recent failure: Lost task 17.3 in stage 22096.0 (TID 25114, 172.31.11.21): java.lang.AssertionError: 
  assertion failed: lapack.dppsv returned 23.
     at scala.Predef$.assert(Predef.scala:179)
     at org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:393)
     at org.apache.spark.ml.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1170)
     at org.apache.spark.ml.recommendation.ALS$$anonfun$org$apache$spark$ml$recommendation$ALS$$computeFactors$1.apply(ALS.scala:1131)
    .....

I knew that there was an early post mentioning a solution: Spark gives a StackOverflowError when training using ALS. But when I put sc.setCheckpointDir('checkpoint/') or ALS.checkpointInterval = 2 ("which prevents the recursion used by the codebase from creating an overflow" as explained by the post) in the code, the execution immediately stops, even not giving any train implicit outcome. The error message is

Traceback (most recent call last):
  File "/home/ubuntu/test/als_implicit.py", line 275, in <module> engine = ImplicitCF(sc, rank=8, seed=5L, iterations=10,reg_parameter=0.06)
  File "/home/ubuntu/test/als_implicit.py", line 129, in __init__self.models_grid_search()
  File "/home/ubuntu/test/als_implicit.py", line 200, in models_grid_search lambda_=reg, blocks=-1, alpha=alphas, nonnegative=False, seed=self.seed)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/recommendation.py", line 201, in trainImplicit
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: 
  An error occurred while calling o116.trainImplicitALSModel.: org.apache.spark.SparkException: Checkpoint RDD ReliableCheckpointRDD[232]       
  at aggregate at ALS.scala:1182(0) has different number of partitions from original RDD itemFactors-10 
  MapPartitionsRDD[230] at mapValues at ALS.scala:1131(18)
      at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:73)
      at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74)
      at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply$mcV$sp(RDD.scala:1655)
      at org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1652)
      ....

Where should I put the sc.setCheckpointDir('checkpoint/') and ALS.checkpointInterval = 2? Should I put right after sc, or before train model? or after train model? The following is my code (for the later error message, with sc.setCheckpointDir('checkpoint/):

from __future__ import print_function
import sys
from pyspark.mllib.recommendation import ALS
from pyspark.mllib.recommendation import Rating
from pyspark import SparkContext, SparkConf

class ImplicitCF(object):
   def __init__(self, sc, rank, seed, iterations, reg_parameter):
       text = sc.textFile(sys.argv[1], 1)
       sc.setCheckpointDir('checkpoint/')
       self.sc = sc

       self.rank = rank
       self.seed = seed
       self.iterations = iterations
       self.reg = reg_parameter

       self.models_grid_search(self)

   .......

   def models_grid_search(self):
       for reg in [1.0, 2.0, 5.0]:
          for alphas in [0.1, 0.5, 1.0, 2.0, 5.0]:
            model = ALS.trainImplicit(self.trainData, rank=self.rank, iterations=self.iterations, lambda_=reg, blocks=-1, alpha=alphas, nonnegative=False, seed=self.seed)
            #self.sc.setCheckpointDir('checkpoint/')                                                       
            #ALS.checkpointInterval = 2    

   ....              

if __name__ == "__main__":
   sc = SparkContext(appName="implicit_train_test")
   engine = ImplicitCF(sc, rank=8, seed=5L, iterations=10, reg_parameter=0.06)
   sc.stop()

Thank you very much for any experts' solutions and input. My spark version is v1.5.2. Meanwhile, as shown above, I am using pyspark.

The first issue is a result of ill-conditioned matrix. Not much you can really do about (you can try adding some random noise but it is rather ugly trick). Regarding the rest... The way you define class looks rather suspicious (you can check for example http://stackoverflow.com/q/32505426/1560062) but without a [mcve] it is just a guess. — zero323, Aug 13 '16 at 18:39
@zero323: thank you very much. Can you show me any information (like webs people have talked about that) of adding random noise? Is the noise on data? When I added all the `rating RDD` (`self.trainData` in my code) by one, i.e. `(user, product, views) -> (users, product, views+1)`, this problem vanishes. I tried many ways, like shrinking the number of iterations, and changing the seeds, all didn't work. Like you said, maybe there is data structure to cause ill-conditioned matrix problem? — TripleH, Aug 14 '16 at 03:16
Not really, but the idea is pretty much the same as adding 1 but instead just use some distribution which should have minimal impact (~N(0, 0.005) for example). — zero323, Aug 14 '16 at 03:40
I think this is a good idea. Do you think doing such perturbation will result in some noise on data analysis or some unpredicted outcome on the model accuracy? — TripleH, Aug 14 '16 at 04:03
Some sure but we work with approximations anyway and typically noise close to limits of the numerical precision is just enough. — zero323, Aug 14 '16 at 13:34
Thanks for explanation. I did another test. When I make `(user, product, views) -> (users, product, views-1)`, there will be many `(users, product,0)` cause ill-conditioned matrix and results in training errors. But instead, if I filter out `(users, product, views <=1)` first, and then do `(users, product, views) -> (users, product, views-1)`. In this case no `(users, product,0)`. The implicit training has NO problem at all. But later I will try to add small noise as you suggested and double check. — TripleH, Aug 15 '16 at 04:26

Spark MLlib error on trainimplicit

0 Answers0