I am using the FPGrowth in sparks's mllib to find frequent patterns. Here is my code:
object FPGrowthExample{
def main(args:Array[String]){
val conf = new SparkConf().setAppName("FPGrowthExample")
val sc = new SparkContext(conf)
val data = sc.textFile("/user/text").map(s => s.trim.split(" ")).cache()
val fpg = new FPGrowth().setMinSupport(0.005).setNumPartitions(10)
val model = fpg.run(data)
val output = model.freqItemsets.map(itemset => itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
output.repartition(1).saveAsTextFile("/user/result")
sc.stop()
}
}
when the text has 800000 lines and each line is treat as a doc, the spark gives a stackoverflowerror. Here is the error:
java.lang.StackOverflowError
at java.lang.Exception.<init>(Exception.java:102)
at java.lang.ReflectiveOperationException.<init>
(ReflectiveOperationException.java:89)
at java.lang.reflect.InvocationTargetException.<init>
(InvocationTargetException.java:72)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137)
at scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:135)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashTable$class.serializeTo(HashTable.scala:124)
at scala.collection.mutable.HashMap.serializeTo(HashMap.scala:39)
at scala.collection.mutable.HashMap.writeObject(HashMap.scala:135)
Here is my submit script:
/usr/local/webserver/spark-1.5.1-bin-2.6.0/bin/spark-submit --master yarn -- deploy-mode cluster
--num-executors 30 --driver-memory 30g
--executor-memory 30g --executor-cores 10
--conf spark.driver.maxResultSize-10g --class FPGrowthExample project.jar
I don't know how to fix it and it runs well when the input only has 1000 lines.