0

When attempting to run my method:

    def doGD() = {
       allRatings.foreach(rating => gradientDescent(rating));
    }

I get the error: org.apache.spark.SparkException: Task not serialisable

I understand that my method of Gradient Descent is not going to parallelise because each step depends upon the previous step - so working in parallel is not an option. However, from the Console if I do this:

    val gd = new GradientDescent()
    gd.doGD();

I get the error as mentioned.

However, if in the Console I do this:

    val gd = new GradientDescent()
    gd.allRatings.foreach(rating => gradientDescent(rating))

It works perfectly fine. As you may have noticed what works in the 2nd example is the same code as in the first example except instead of a method I just take the code out of the method and call it directly.

Why does the one work and the other does not? I'm bemused.

(Additional note: Class GradientDescent extends Serializable ).

The gradientDescent method:

def gradientDescent(rating : Rating) = { 

var userVector = userFactors.get(rating.user).get
var itemVector = itemFactors.get(rating.product).get

userFactors.map(x => if(x._1 == rating.user)(x._1, x._2 += 0.02 * (calculatePredictionError(rating.rating, userVector, itemVector) * itemVector)))
userVector = userFactors.get(rating.user).get // updated user vector

itemFactors.map(x => if(x._1 == rating.product)(x._1, x._2 += 0.02 * (calculatePredictionError(rating.rating, userVector, itemVector) * itemVector)))
}

I know I'm using 2 vars stored on the master - userFactors and itemFactors - and as the process is sequential parallelising is not possible. But this doesn't explain why calling the method from the Console does not work but re-writing the inners of the method in the Console does.

monster
  • 1,762
  • 3
  • 20
  • 38

1 Answers1

0

Hard to tell without full source of GradientDescent class, but you're probably capturing an unserializable value. When running the method, it needs to serialize the full object and send it to the workers, while inlined version doesn't.

Dan Osipov
  • 1,429
  • 12
  • 15
  • Could you explain why a value could be unserialisable? I'm really just wondering why it doesn't work when calling the method `doGD()` but does work when I write the code that is within the method! Thank you for your reply. – monster Mar 24 '15 at 22:58
  • Detailed code that cleans closures is here: https://github.com/apache/spark/blob/ef4ff00f87a4e8d38866f163f01741c2673e41da/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala In summary any references to spark context, or open resources (sockets, files, etc) cause the task to not be serializable – Dan Osipov Mar 24 '15 at 23:03
  • Thanks, Dan. However, I'm not too sure what you have linked me to. I have added the Gradient Descent method into my question if you care to take a look. – monster Mar 24 '15 at 23:56
  • Note that dependencies are transitive. It looks like `gradientDescent` references lots of other objects. Are some RDDs? Do they reference file/socket handles transitively, etc. The JVM will attempt to serialize the tree of dependencies. – Dean Wampler Mar 25 '15 at 03:05
  • for a overall understanding of Spark serialization http://stackoverflow.com/questions/40818001/understanding-spark-serialization/40818002?sfb=2#40818002 – KrazyGautam Nov 26 '16 at 13:00