3

I am using caffe library for image detection using PySpark framework. I am able to run the spark program in local mode where model is present in local file system.

But when I want to deploy it into cluster mode, I don't know what is the correct way to do. I have tried the following approach:

  1. Adding the files to HDFS, and using addfile or --file when submitting jobs

    sc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")

  2. Reading the model in each worker node using

    model_weight =SparkFiles.get('test.caffemodel') net = caffe.Net(model_define, model_weight, caffe.TEST)

Since SparkFiles.get() will return the local file location in the worker node(not the HDFS one) so that I can reconstruct my model using the path it returns. This approach also works fine in local mode, however, in distributed mode it will result in the following error:

ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V

It seems like the data is too large to shuffle as discussed in Apache Spark: network errors between executors However, the size of model is only around 1M.

Updated:

I found that if the path in sc.addFile(path) is on HDFS, then the error will not appear. However, when the path is in local file system, the error will appear.

My questions are

  1. Is any other possibility that will cause the above exception other than the size of the file. ( The spark is running on YARN, and I use the default shuffle service not external shuffle service )

  2. If I do not add the file when submmit, how do I read the model file from HDFS using PySpark? (So that I can reconstruct model using caffe API). Or is there any way to get the path other than SparkFiles.get()?

Any suggestions will be appreciated!!

Community
  • 1
  • 1
steve
  • 145
  • 2
  • 10
  • 1
    The reason you're getting this exception is because at the runtime Spark gets an older version of `netty` library than it expects. Check `CLASSPATH` of the job. – nonsleepr Apr 12 '17 at 19:44
  • Yes! The problem is that I add two `netty` jar to my `CLASSPATH`. Thanks a lot! – steve Apr 12 '17 at 20:43

0 Answers0