2

I am using the FPGrowth in sparks's mllib to find frequent patterns. Here is my code:

object FPGrowthExample{
   def main(args:Array[String]){ 
       val conf = new SparkConf().setAppName("FPGrowthExample")
       val sc = new SparkContext(conf)
       val data = sc.textFile("/user/text").map(s => s.trim.split(" ")).cache()
       val fpg = new FPGrowth().setMinSupport(0.005).setNumPartitions(10)
       val model = fpg.run(data)
       val output = model.freqItemsets.map(itemset => itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
       output.repartition(1).saveAsTextFile("/user/result")
       sc.stop()
  }
}

when the text has 800000 lines and each line is treat as a doc, the spark gives a stackoverflowerror. Here is the error:

java.lang.StackOverflowError
at java.lang.Exception.<init>(Exception.java:102)
at java.lang.ReflectiveOperationException.<init>                                   
(ReflectiveOperationException.java:89)
at java.lang.reflect.InvocationTargetException.<init> 
(InvocationTargetException.java:72)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137)
at   scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:135)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashTable$class.serializeTo(HashTable.scala:124)
at scala.collection.mutable.HashMap.serializeTo(HashMap.scala:39)
at scala.collection.mutable.HashMap.writeObject(HashMap.scala:135)

Here is my submit script:

/usr/local/webserver/spark-1.5.1-bin-2.6.0/bin/spark-submit --master yarn --   deploy-mode cluster 
--num-executors 30 --driver-memory 30g 
--executor-memory 30g --executor-cores 10 
--conf spark.driver.maxResultSize-10g --class FPGrowthExample project.jar

I don't know how to fix it and it runs well when the input only has 1000 lines.

chenqun
  • 21
  • 2
  • I think MinSupport may be too small when the data size is very large. It may cause the large time complexity for the FPGrowth. – chenqun Jul 01 '16 at 03:26
  • I think one of libraries is not compatible. Either version is too old or too new. – Patrik Bego Feb 08 '19 at 21:35

0 Answers0