3

I am trying to run an example of the FPGrowth algorithm in Spark, however, I am coming across an error. This is my code:

import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}

val transactions: RDD[Array[String]] = sc.textFile("path/transations.txt").map(_.split(" ")).cache()

val fpg = new FPGrowth().setMinSupport(0.2).setNumPartitions(10)

val model = fpg.run(transactions)

model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)}

The code works up until the last line where I get the error:

WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 16, ip-10-0-0-###.us-west-1.compute.internal): 
com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set 
final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer
Serialization trace:
nodes (org.apache.spark.mllib.fpm.FPTree$Summary)

I have even tried to use the solution that was proposed here: SPARK-7483

I haven't had any luck with this either. Has anyone found a solution to this? Or does anyone know of a way to just view the results or save them to a text file?

Any help would be greatly appreciated!

I also found the full source code for this algorithm - http://mail-archives.apache.org/mod_mbox/spark-commits/201502.mbox/%3C1cfe817dfdbf47e3bbb657ab343dcf82@git.apache.org%3E

RDizzl3
  • 318
  • 3
  • 13
  • I get errors too when I run among the simplest possible of example datasets that I came up with. I get some kind of type casting error. If you get some progress on YOURS please do share your findings. thanks – Geoffrey Anderson Sep 17 '15 at 14:15

3 Answers3

2

Kryo is a faster serializer than org.apache.spark.serializer.JavaSerializer. A possible workaround is tell spark not to use Kryo (at least until this bug is fixed). You can modify the "spark-defaults.conf", but Kryo works fine for other spark libraries. So the best is modify your context with:

val conf = (new org.apache.spark.SparkConf()
           .setAppName("APP_NAME")
           .set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")

And try to run again MLLIb code:

model.freqItemsets.collect().foreach { itemset => println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)}

It should work now.

Luis
  • 159
  • 4
  • 12
1

I got the same error: This is because of spark version. In Spark 1.5.2 this is fixed, however I was using 1.3. I fixed by doing the following:

  1. I switched from using spark-shell to spark-submit and then changed the configuration for kryoserializer. Here is my code:

    import org.apache.spark.{SparkConf, SparkContext}
    import org.apache.spark.rdd.RDD
    import org.apache.spark.mllib.fpm.FPGrowth
    import scala.collection.mutable.ArrayBuffer
    import scala.collection.mutable.ListBuffer
    
    object fpgrowth {
      def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("Spark FPGrowth")
          .registerKryoClasses(
            Array(classOf[ArrayBuffer[String]], classOf[ListBuffer[String]])
          )
    
        val sc = new SparkContext(conf)
    
        val data = sc.textFile("<path to file.txt>")
    
        val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' '))
    
        val fpg = new FPGrowth()
          .setMinSupport(0.2)
          .setNumPartitions(10)
        val model = fpg.run(transactions)
    
        model.freqItemsets.collect().foreach { itemset =>
          println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
        }
    
      }
    }
    
robinki
  • 362
  • 3
  • 18
user1050325
  • 1,252
  • 2
  • 10
  • 14
1

set config below in cmd or spark-defaults.conf --conf spark.kryo.classesToRegister=scala.collection.mutable.ArrayBuffer,scala.collection.mutable.ListBuffer