3

I'm new in Spark.

I want to parallelize my computations using Spark and map-reduce approach. But this computations, which I put into PairFunction implementation for the Map stage, requres some context to be initialized. This context includes several singleton objects from the 3rd party jar, and this objects are not serializable, so I can not spread them across worker nodes and can not use them in my PairFunction.

So my question is: can I somehow parallelize job which requires non-serializable context using Apache Spark? Are there any other solutions? Maybe I can somehow tell Spark to initialize required context on every worker node?

pikkvile
  • 2,531
  • 2
  • 17
  • 16
  • You question is a bit ambiguous for me. I will try to answer based on my understanding of it. Spark has two main execution environment : The driver where the code would run in a normal (non-distributed) way. This is where you can initialize your context and open the spark context. The distributed code would be executed over the workers. – Ahmed Kamal Jan 25 '16 at 14:31
  • My question is about distributed code that should be executed over the workers. The problem is that this code has to use non-serializable third party objects. So I can not instantiate them once on the master and then pass to workers over the network. I'm wondering is there any workarounds. – pikkvile Jan 25 '16 at 14:49
  • If your code would be shipped to the workers it should be serialized. There are no work arounds. If you don't need these objects inside the workers you can declare them as transient. – Ahmed Kamal Jan 25 '16 at 15:06

1 Answers1

2

You can try initializing your 3rd party jar in executor by using mapPartition or foreachPartition .

rdd.foreachPartition { iter =>
  //initialize here
  val object = new XXX()
  iter.foreach { p =>
    //then you can use object here
  }
}
Wilson Liao
  • 608
  • 1
  • 6
  • 25
  • Thank you. Could you please explain a bit, what these rdd methods are exactly doing? I opened spark javadocs, and there're no too much details: "foreachPartition - Applies a function f to each partition of this RDD." and that's it. – pikkvile Jan 25 '16 at 19:12
  • `foreachPartition` executes a function for each partition. Access to the data items contained in the partition is provided via the iterator argument. Spark will try to intialize a variable on the driver (master), then serialize the object to send it over to the worker, it will fail if the object is not serializable. – Wilson Liao Jan 26 '16 at 05:43