has training error using pyspark ALS

Question

I run Spark on a virtual machine and implemented ALS library to train my data.

rawRatings = sc.textFile('data/ratings.csv').map(lambda x: x.replace('\t', ','))
parsedRatings = rawRatings.map(lambda x: x.split(',')).map(lambda x: Rating(int(x[0]), int(x[1]), float(x[2])))
trainData, valData, testData = parsedRatings.randomSplit([0.6, 0.2, 0.2], seed=42)
model = ALS.train(trainData, rank=8, iterations=5, lambda_=0.1)

It works. But if I tuned iteration=10, then it shows the error message:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-181-e64eb91ba0eb> in <module>()
      6 regularization_parameter = 0.1
      7 tolerance = 0.02
----> 8 model = ALS.train(trainData, rank=8, seed=seed, iterations=7, lambda_=regularization_parameter)

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/recommendation.py in train(cls, ratings, rank, iterations, lambda_, blocks, nonnegative, seed)
    138               seed=None):
    139         model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
--> 140                               lambda_, blocks, nonnegative, seed)
    141         return MatrixFactorizationModel(model)
    142 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py in callMLlibFunc(name, *args)
    118     sc = SparkContext._active_spark_context
    119     api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 120     return callJavaFunc(sc, api, *args)
    121 
    122 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py in callJavaFunc(sc, func, *args)
    111     """ Call Java Function """
    112     args = [_py2java(sc, a) for a in args]
--> 113     return _java2py(sc, func(*args))
    114 
    115 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling o7508.trainALSModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14882.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14882.0 (TID 3699, localhost): java.lang.StackOverflowError
    at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
    at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
    at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1505)
    .....

I am just wondering what's wrong with that? It is ok to tune iterations =6, but iterations = 7 will start to have such error message again. I used it in iPython and Python 3.x version. Thanks for any generous answers!

Thank you. I use `rawRatings = sc.textFile('data/ratings.csv').map(lambda x: x.replace('\t', ','))` to generate RDD as I updated in my post. How can I update Spark in the VM? — TripleH, Jul 03 '16 at 14:04
Most likely you didn't set `checkpointDir`. Regarding update - same as usual. Download new version and adjust environment. — zero323, Jul 03 '16 at 14:10
thank you. I am not sure what I did is correct or not. I added `sc.setCheckpointDir('data/')` and the problem still exists. What should I expect to see after I executed `sc.setCheckpointDir('data/')`? How to implement `ALS.checkpointInterval`? I feel it may be Spark version or I run the VM in my own laptop (not in a cluster) with small memory? — TripleH, Jul 03 '16 at 15:49
Yes. I executed `sc.setCheckpointDir('data/checkpoint/')` and `ALS.checkpointInterval=2`, and then `model = ALS.train(trainData, rank=8, iterations=7, lambda_=0.1)`. I still have the same error message. Is it because of old version on Spark? But in the post: http://stackoverflow.com/questions/31484460/spark-gives-a-stackoverflowerror-when-training-using-als, spark version 1.4 and 1.5 also show the same error indication. Thanks. — TripleH, Jul 03 '16 at 22:49
To be honest I don't know. Pretty much the main reason for SO in Spark is long lineage and it should be solved by checkpointing. — zero323, Jul 04 '16 at 00:00
So the way I did is a correct approach? just adding the two sentence? I will keep searching for the answer. Thank you for your help! — TripleH, Jul 04 '16 at 00:39
Let's say that at the first glance I don't see anything terribly wrong :) — zero323, Jul 04 '16 at 00:41
@zero323, it is Spark version problem. I used v1.5.2 and even iterations = 20 still works. NO need for `sc.setCheckpoinDir()` — TripleH, Aug 13 '16 at 13:27

has training error using pyspark ALS

0 Answers0