I want to serialize a Scalding TypedPipe[MyClass]
and desrialize it in Spark 1.5.1.
I am able to serialize/deserialize a "simple" case class containing only "primitives" such as Booleans and Maps, using kryo and Twitter's Chill for Scala:
//In Scalding
case class MyClass(val foo: Boolean) extends Serializable {}
val data = ... //TypedPipe[MyClass]
def serialize[A](data: A) = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val bao = new ByteArrayOutputStream
val output = new Output(bao)
kryo.writeObject(output, data)
output.close
bao.toByteArray()
}
data.map(t => (NullWritable.get, new BytesWritable(serialize(t))))
.write(WritableSequenceFile(outPath))
//In Spark:
def deserialize[A](ser: Array[Byte], clazz: Class[A]): A = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
val input = new Input(new ByteArrayInputStream(ser))
val deserData = kryo.readObject(input, clazz)
deserData
}
sc.sequenceFile(inPath, classOf[NullWritable], classOf[BytesWritable]).map(_._2)
.map(t => deserialize(t.get, classOf[MyClass])) //where 'sc' is SparkContext
I am also able to serialize/deserialize a "complex" class, containing members of other custom classes written by me or not (e.g. org.joda.time.LocalDate
). I am registering the classes during serialization and de-serialization in the same order as mentioned in Kryo documentation, using kryo's default Serializer:
//In Scalding
class MyClass2(val bar: MyClass, val someDate: LocalDate) extends Serializable {}
def serialize[A](data: A) = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
kryo.register(classOf[MyClass2])
kryo.register(classOf[MyClass])
kryo.register(classOf[LocalDate])
kryo.register(classOf[ISOChronology])
kryo.register(classOf[GregorianChronology])
val bao = new ByteArrayOutputStream
val output = new Output(bao)
kryo.writeObject(output, data)
output.close
bao.toByteArray()
}
//In Spark
def deserialize[A](ser: Array[Byte], clazz: Class[A]): A = {
val instantiator = new ScalaKryoInstantiator
instantiator.setRegistrationRequired(false)
val kryo = instantiator.newKryo()
kryo.register(classOf[MyClass2])
kryo.register(classOf[MyClass])
kryo.register(classOf[LocalDate])
kryo.register(classOf[ISOChronology])
kryo.register(classOf[GregorianChronology])
val input = new Input(new ByteArrayInputStream(ser))
val deserData = kryo.readObject(input, clazz)
deserData
}
a) As said, this works but it seems too verbose. Am I missing a simpler way of doing this?
b) When I registered only LocalDate, Spark complained it didn't "know" ISOChronology. When I registered ISOChronology it complained it didn't know GregorianChronology. I registered GregorianChronology and Spark stopped complaining and everything works. Is there a way to register LocalDate "and everything in it"?