0

I use Apache Commons Lang3's SerializationUtils in the code.

SerializationUtils.serialize()

to store a customized class as files into disk and

SerializationUtils.deserialize(byte[])

to restore them again.

In the local environment (Mac OS), all serialized files can be deserialized normally and no error happens. But when I copy these serialized files into HDFS, and read them from HDFS by using Spark/Scala, a SerializeException happens.

The Apache Commons Lang3 version is:

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.9</version>
    </dependency>

The deserialize code like this:

public static Block deserializeFrom(byte[] bytes) {
    try {
        Block b = SerializationUtils.deserialize(bytes);
        System.out.println("b="+b);
        return b;
    } catch (ClassCastException e) {
        System.out.println("ClassCastException");
        e.printStackTrace();
    } catch (IllegalArgumentException e) {
        System.out.println("IllegalArgumentException");
        e.printStackTrace();

    } catch (SerializationException e) {
        System.out.println("SerializationException");
        e.printStackTrace();
    }
    return null;
}

The Spark code is:

val fis = spark.sparkContext.binaryFiles("/folder/abc*.file")
val RDD = fis.map(x => {
  val content = x._2.toArray()    
  val b = Block.deserializeFrom(content)
  ...
}

All codes above can run successfully in Spark local mode, but when run it in Yarn cluster mode, an error happens. The stack error as below:

org.apache.commons.lang3.SerializationException: java.lang.ClassNotFoundException: com.XXXX.XXXX
    at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:227)
    at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:265)
    at com.com.XXXX.XXXX.deserializeFrom(XXX.java:81)
    at com.XXX.FFFF$$anonfun$3.apply(BXXXX.scala:157)
    at com.XXX.FFFF$$anonfun$3.apply(BXXXX.scala:153)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:945)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.XXXX.XXXX
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:686)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1868)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1751)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
    at org.apache.commons.lang3.SerializationUtils.deserialize(SerializationUtils.java:223)

I've check the loaded byte[]'s length, both from local and from HDFS are same. But why it can not be deserialized from HDFS?

brucenan
  • 584
  • 9
  • 21
  • its not reproducable. moreover i strongly believe that your object which you are serializing is not supporting serialization its not able to resolve the class. – Ram Ghadiyaram Jun 26 '19 at 03:49
  • I've serialized object to files (stored in local disk), and deserialized them successfully. The operations codes in local and in HDFS are same. Event get the same length byte[]. But results are different. – brucenan Jun 26 '19 at 04:05
  • interesting,.. i have used that api many times but never faced any issue. i serialized in to hbase and taken back same. i believe some mysterious thing is there in your hdfs serialization. – Ram Ghadiyaram Jun 26 '19 at 04:13
  • Actually, the serialized files in HDFS is copied from local disk, not be serialized to HDFS directly. I think the key point is java.lang.ClassNotFoundException, maybe there's something wrong about spark job. – brucenan Jun 26 '19 at 04:19
  • try to directly serialize to hdfs since file system semantics are different it will work then... – Ram Ghadiyaram Jun 26 '19 at 04:27
  • I've compared local file and HDFS file, two files are totally same, bit by bit.... – brucenan Jun 26 '19 at 08:32
  • @RamGhadiyaram I've solved this problem. It seems Java Generic mechanism problem in Spark Yarn mode. I copied lang3 source code and change T to my Class. It runs OK. – brucenan Jun 27 '19 at 06:01

1 Answers1

0

This may be a classloader issue. Suppose your application is deployed to a Java server. The server will have loaded its own classes including library code it may need, for example SerializationUtils from Apache commons-lang3. When your application is deployed to it, the server may provide it with a separate classloader which inherits from the server's classloader. Let's call the server's classloader Cl-S and the deployed application's classloader Cl-A.

At some point the application wishes to deserialize an object from a byte[]. So it uses org.apache.commons.lang3.SerializationUtils. Cl-A is asked to provide that class. The first time around Cl-A won't have it so it has to load it in. But a classloader will commonly first ask its parent for a class before trying to load it by itself. Cl-A asks Cl-S if it happens to have SerializationUtils. If it does, it returns the class. Now the application can use it.

Things go wrong when you then perform the deserialization. The deserialize method is generic. This line

Block b = SerializationUtils.deserialize(bytes)

has the type Block inferred. The method will internally try to cast the deserialized Object to Block. But of course, to do so it must know class Block. When performing the method Java will go looking for that class. But for this it queries the classloader that loaded in SerializationUtils. This is Cl-S. Cl-S is the server's classloader, it has no knowledge of your application's Block class. so you get a ClassNotFoundException.

The classloader assigned to the application has access to your application's classes and its parent classloader's classes. The server classloader can't go in the other direction, it can't get classes from your application. Application servers, like Java EE ones (Wildfly, Glassfish etc.) typically use this to allow multiple applications to run in the server but remain separated, or to implement a module system so certain modules can be shared across applications to reduce their size and memory footprint.

Serializing and deserializing objects in Java is simple. Just do it yourself or write a couple methods for it rather than dragging in a library that opens you up to opaque issues like this, version conflicts and bloat.

G_H
  • 11,739
  • 3
  • 38
  • 82