We have used TensorFlow Serving to load the model and implement the Java gRPC client.
Normal it works for small data. But if we request with larger batch size and data is almost 1~2M, the server closes the connection and throw the internal error quickly.
We have also open an issue to track this in https://github.com/tensorflow/serving/issues/284.
Job aborted due to stage failure: Task 47 in stage 7.0 failed 4 times, most recent failure: Lost task 47.3 in stage 7.0 (TID 5349, xxx)
io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: INTERNAL_ERROR
Received Rst Stream
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:230)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:211)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:144)
at tensorflow.serving.PredictionServiceGrpc$PredictionServiceBlockingStub.predict(PredictionServiceGrpc.java:160)
......
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace: