1

I am running spark master and slaves in standalone mode, no Hadoop cluster. Using spark-shell, I can quickly build a FPGrowthModel with my data. Once the model is built, I am trying to look at the patterns and frequencies captured within the model, but spark hangs at the collect() method (by looking at Spark UI) with larger dataset (200000 * 2000 matrix like data). Here is the code I run in spark-shell:

import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
import org.apache.spark.rdd.RDD

val textFile = sc.textFile("/path/to/txt/file")
val data = textFile.map(_.split(" ")).cache()

val fpg = new FPGrowth().setMinSupport(0.9).setNumPartitions(8)
val model = fpg.run(data)

model.freqItemsets.collect().foreach { itemset =>
  println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}

I tried to increase spark shell memory from 512MB to 2GB but didnt seem to alleviate the hanging problem. I am not sure if its because Hadoop is needed in order to perform this task, or I need to increase spark-shell memory even more, or something else.

15/08/10 22:19:40 ERROR TaskSchedulerImpl: Lost executor 0 on 142.103.22.23: remote Rpc client disassociated
15/08/10 22:19:40 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@142.103.22.23:43440] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/10 22:19:40 INFO AppClient$ClientActor: Executor updated: app-20150810163957-0001/0 is now EXITED (Command exited with code 137)
15/08/10 22:19:40 INFO TaskSetManager: Re-queueing tasks for 0 from TaskSet 4.0
15/08/10 22:19:40 INFO SparkDeploySchedulerBackend: Executor app-20150810163957-0001/0 removed: Command exited with code 137
15/08/10 22:19:40 WARN TaskSetManager: Lost task 3.0 in stage 4.0 (TID 59, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 6.0 in stage 4.0 (TID 62, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 56, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 2.0 in stage 4.0 (TID 58, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 5.0 in stage 4.0 (TID 61, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 4.0 in stage 4.0 (TID 60, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 7.0 in stage 4.0 (TID 63, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 WARN TaskSetManager: Lost task 1.0 in stage 4.0 (TID 57, 142.103.22.23): ExecutorLostFailure (executor 0 lost)
15/08/10 22:19:40 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/08/10 22:19:40 INFO AppClient$ClientActor: Executor added: app-20150810163957-0001/1 on worker-20150810163259-142.103.22.23-48853 (142.103.22.23:48853) with 8 cores
15/08/10 22:19:40 INFO SparkDeploySchedulerBackend: Granted executor ID app-20150810163957-0001/1 on hostPort 142.103.22.23:48853 with 8 cores, 15.0 GB RAM
15/08/10 22:19:40 INFO AppClient$ClientActor: Executor updated: app-20150810163957-0001/1 is now LOADING
15/08/10 22:19:40 INFO DAGScheduler: Executor lost: 0 (epoch 2)
15/08/10 22:19:40 INFO AppClient$ClientActor: Executor updated: app-20150810163957-0001/1 is now RUNNING
15/08/10 22:19:40 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster.
15/08/10 22:19:40 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, 142.103.22.23, 37411)
15/08/10 22:19:40 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
15/08/10 22:19:40 INFO ShuffleMapStage: ShuffleMapStage 3 is now unavailable on executor 0 (0/16, false)
zero323
  • 322,348
  • 103
  • 959
  • 935
emily
  • 198
  • 2
  • 10

3 Answers3

0

You should not run .collect() if the dataset is big, like if it is several GB, you should not use it, it helps speeding things up for doing several evaluations. Run the foreach loop without collecting.

Dr VComas
  • 735
  • 7
  • 22
  • Hi thanks for your help, I removed the collect() method from the code and found printed output in Spark UI stdout section, that's one step forward! However, 6 hours later, there are some error messages in console, I cannot tell if the model has finished iterating through all the items. I have attached them in the original question. Thank you! – emily Aug 11 '15 at 17:25
  • You can try it with a smaller data set first, see if it works, then you will figure out if it's a resources problem or what. – Dr VComas Aug 11 '15 at 18:27
  • thanks! I think my model was too big, i filtered the items and the situation was handled – emily Aug 11 '15 at 18:38
0

Kryo is a faster serializer than org.apache.spark.serializer.JavaSerializer. A possible workaround is tell spark not to use Kryo:

val conf = (new org.apache.spark.SparkConf()
.setAppName("APP_NAME")
.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")

And try to run again your code above.

See this link for reference:

FPGrowth Algorithm in Spark

Community
  • 1
  • 1
Luis
  • 159
  • 4
  • 12
  • 1
    Please don't post the exact same answer to multiple questions: it's either not a good fit for all or the questions are duplicates which should be flagged/closed as such. – kleopatra Sep 28 '15 at 10:46
  • Thanks for both comments... I will edit the answer to include essentials parts and provide the link for reference. – Luis Sep 29 '15 at 11:01
0

Try replacing collect() with local iterator. Ultimately, you may be running into a limitation of the FPGrowth implementation. See my posting here and Spark JIRA issue.

Community
  • 1
  • 1
Raj
  • 2,852
  • 4
  • 29
  • 48