I am using caffe library for image detection using PySpark framework. I am able to run the spark program in local mode where model is present in local file system.
But when I want to deploy it into cluster mode, I don't know what is the correct way to do. I have tried the following approach:
Adding the files to HDFS, and using
addfile
or--file
when submitting jobssc.addFile("hdfs:///caffe-public/dataset/test.caffemodel")
Reading the model in each worker node using
model_weight =SparkFiles.get('test.caffemodel') net = caffe.Net(model_define, model_weight, caffe.TEST)
Since SparkFiles.get()
will return the local file location in the worker node(not the HDFS one) so that I can reconstruct my model using the path it returns. This approach also works fine in local mode, however, in distributed mode it will result in the following error:
ERROR server.TransportRequestHandler: Error sending result StreamResponse{streamId=/files/xxx, byteCount=xxx, body=FileSegmentManagedBuffer{file=xxx, offset=0,length=xxxx}} to /192.168.100.40:37690; closing connection
io.netty.handler.codec.EncoderException: java.lang.NoSuchMethodError: io.netty.channel.DefaultFileRegion.<init>(Ljava/io/File;JJ)V
It seems like the data is too large to shuffle as discussed in Apache Spark: network errors between executors However, the size of model is only around 1M.
Updated:
I found that if the path in sc.addFile(path)
is on HDFS, then the error will not appear. However, when the path is in local file system, the error will appear.
My questions are
Is any other possibility that will cause the above exception other than the size of the file. ( The spark is running on YARN, and I use the default shuffle service not external shuffle service )
If I do not add the file when submmit, how do I read the model file from HDFS using PySpark? (So that I can reconstruct model using caffe API). Or is there any way to get the path other than
SparkFiles.get()
?
Any suggestions will be appreciated!!