1

I want to serialize a Scalding TypedPipe[MyClass] and desrialize it in Spark 1.5.1.

I am able to serialize/deserialize a "simple" case class containing only "primitives" such as Booleans and Maps, using kryo and Twitter's Chill for Scala:

//In Scalding
case class MyClass(val foo: Boolean) extends Serializable {}

val data = ... //TypedPipe[MyClass]

def serialize[A](data: A) = {
  val instantiator = new ScalaKryoInstantiator
  instantiator.setRegistrationRequired(false)
  val kryo = instantiator.newKryo()
  val bao = new ByteArrayOutputStream
  val output = new Output(bao)
  kryo.writeObject(output, data)
  output.close
  bao.toByteArray()
}

data.map(t => (NullWritable.get, new BytesWritable(serialize(t))))
  .write(WritableSequenceFile(outPath))

//In Spark:
def deserialize[A](ser: Array[Byte], clazz: Class[A]): A = {
  val instantiator = new ScalaKryoInstantiator
  instantiator.setRegistrationRequired(false)
  val kryo = instantiator.newKryo()
  val input = new Input(new ByteArrayInputStream(ser))
  val deserData = kryo.readObject(input, clazz)
  deserData
}

sc.sequenceFile(inPath, classOf[NullWritable], classOf[BytesWritable]).map(_._2)
  .map(t => deserialize(t.get, classOf[MyClass])) //where 'sc' is SparkContext

I am also able to serialize/deserialize a "complex" class, containing members of other custom classes written by me or not (e.g. org.joda.time.LocalDate). I am registering the classes during serialization and de-serialization in the same order as mentioned in Kryo documentation, using kryo's default Serializer:

//In Scalding
class MyClass2(val bar: MyClass, val someDate: LocalDate) extends Serializable {}

def serialize[A](data: A) = {
  val instantiator = new ScalaKryoInstantiator
  instantiator.setRegistrationRequired(false)
  val kryo = instantiator.newKryo()
  kryo.register(classOf[MyClass2])
  kryo.register(classOf[MyClass])
  kryo.register(classOf[LocalDate])
  kryo.register(classOf[ISOChronology])
  kryo.register(classOf[GregorianChronology])
  val bao = new ByteArrayOutputStream
  val output = new Output(bao)
  kryo.writeObject(output, data)
  output.close
  bao.toByteArray()
}

//In Spark
def deserialize[A](ser: Array[Byte], clazz: Class[A]): A = {  
  val instantiator = new ScalaKryoInstantiator
  instantiator.setRegistrationRequired(false)
  val kryo = instantiator.newKryo()
  kryo.register(classOf[MyClass2])
  kryo.register(classOf[MyClass])
  kryo.register(classOf[LocalDate])
  kryo.register(classOf[ISOChronology])
  kryo.register(classOf[GregorianChronology])
  val input = new Input(new ByteArrayInputStream(ser))
  val deserData = kryo.readObject(input, clazz)
  deserData
}

a) As said, this works but it seems too verbose. Am I missing a simpler way of doing this?

b) When I registered only LocalDate, Spark complained it didn't "know" ISOChronology. When I registered ISOChronology it complained it didn't know GregorianChronology. I registered GregorianChronology and Spark stopped complaining and everything works. Is there a way to register LocalDate "and everything in it"?

Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
  • Why are you writing your own serializers/deserializers? To use Kryo serializer in Spark, you simple configure it on the SparkContext. Like `val conf = new SparkConf().setName("MyName").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2], ...))`. See https://spark.apache.org/docs/1.5.1/tuning.html – Glennie Helles Sindholt Oct 21 '15 at 09:16
  • That does look like it could help with (a), but only in Spark. Thank you! – Giora Simchoni Oct 21 '15 at 09:43

0 Answers0