GRPC: make high-throughput client in Java/Scala

Question

I have a service that transfers messages at a quite high rate.

Currently it is served by akka-tcp and it makes 3.5M messages per minute. I decided to give grpc a try. Unfortunately it resulted in much smaller throughput: ~500k messages per minute an even less.

Could you please recommend how to optimize it?

My setup

Hardware: 32 cores, 24Gb heap.

grpc version: 1.25.0

Message format and endpoint

Message is basically a binary blob. Client streams 100K - 1M and more messages into the same request (asynchronously), server doesn't respond with anything, client uses a no-op observer

service MyService {
    rpc send (stream MyMessage) returns (stream DummyResponse);
}

message MyMessage {
    int64 someField = 1;
    bytes payload = 2;  //not huge
}

message DummyResponse {
}

Problems: Message rate is low compared to akka implementation. I observe low CPU usage so I suspect that grpc call is actually blocking internally despite it says otherwise. Calling onNext() indeed doesn't return immediately but there is also GC on the table.

I tried to spawn more senders to mitigate this issue but didn't get much of improvement.

My findings Grpc actually allocates a 8KB byte buffer on each message when serializes it. See the stacktrace:

java.lang.Thread.State: BLOCKED (on object monitor) at com.google.common.io.ByteStreams.createBuffer(ByteStreams.java:58) at com.google.common.io.ByteStreams.copy(ByteStreams.java:105) at io.grpc.internal.MessageFramer.writeToOutputStream(MessageFramer.java:274) at io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:230) at io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168) at io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141) at io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:53) at io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37) at io.grpc.internal.DelayedStream.writeMessage(DelayedStream.java:252) at io.grpc.internal.ClientCallImpl.sendMessageInternal(ClientCallImpl.java:473) at io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:457) at io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37) at io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37) at io.grpc.stub.ClientCalls$CallToStreamObserverAdapter.onNext(ClientCalls.java:346)

Any help with best practices on building high-throughput grpc clients appreciated.

Are you using Protobuf? This code path should only be taken if the InputStream returned by MethodDescriptor.Marshaller.stream() does not implement Drainable. The Protobuf Marshaller does support Drainable. If you are using Protobuf, is it possible a ClientInterceptor is changing the MethodDescriptor? — Eric Anderson, Nov 08 '19 at 17:08
@EricAnderson thank you for you response. I tried the standard protobuf with gradle (com.google.protobuf:protoc:3.10.1, io.grpc:protoc-gen-grpc-java:1.25.0) and also `scalapb`. Probably this stacktrace was indeed from to scalapb-generated code. I removed everything related to scalapb but it didn't help much wrt performance. — simpadjo, Nov 08 '19 at 17:29
@EricAnderson I solved my problem. Pinging you as a developer of grpc. Does my answer make sense? — simpadjo, Nov 13 '19 at 12:32

score 6 · Accepted Answer · answered Nov 13 '19 at 12:22

6

I solved the issue by creating several ManagedChannel instances per destination. Despite articles say that a ManagedChannel can spawn enough connections itself so one instance is enough it's wasn't true in my case.

Performance is in parity with akka-tcp implementation.

answered Nov 13 '19 at 12:22

simpadjo

3,947
1
13
38

1

ManagedChannel (with built-in LB policies) does not use more than one connection per backend. So if you are high-throughput with few backends it is possible to saturate the connections to all the backends. Using multiple channels can increase performance in those cases. – Eric Anderson Nov 14 '19 at 00:04
@EricAnderson thanks. In my case spawning several channels even to a single backend node has helped – simpadjo Nov 14 '19 at 09:35
The fewer the backends and the higher the bandwidth, the more likely you need multiple channels. So "single backend" would make it more likely more channels is helpful. – Eric Anderson Nov 15 '19 at 20:02
@simpadjo I am interested in understanding how are creating & managing multiple channels for the same backend. Thanks! – sonam Jul 11 '23 at 11:59

score 0 · Answer 2 · answered Nov 09 '19 at 11:51

0

Interesting question. Computer network packages are encoded using a stack of protocols, and such protocols are built on top of the specifications of the previous one. Hence the performance (throughput) of a protocol is bounded by the performance of the one used to built it, since you are adding extra encoding/decoding steps on top of the underlying one.

For instance gRPC is built on top of HTTP 1.1/2, which is a protocol on the Application layer, or L7, and as such its performance is bound by the performance of HTTP. Now HTTP itself is build on top of TCP, which is at Transport layer, or L4, so we can deduce that gRPC throughput cannot be larger than an equivalent code served in the TCP layer.

In other words: if you server is able to handle raw TCP packages, how adding new layers of complexity (gRPC) would improve performance?

answered Nov 09 '19 at 11:51

Batato

560
5
18

For exactly that reason I use streaming approach: I pay once for establishing an http connection and send ~300M messages using it. It uses websockets under the hood which I expect to have relatively low overhead. – simpadjo Nov 09 '19 at 12:38
For `gRPC` you also pay once for establishing a connection, but you have added the extra burden of parsing protobuf. Anyway it's hard to make guesses without too much information, but I would bet that, in general, since you are adding extra encoding/decoding steps in your pipeline the `gRPC` implementation would be slower than the equivalent web socket one. – Batato Nov 10 '19 at 15:26
Akka adds up some overhead as well. Anyway x5 slowdown looks too much. – simpadjo Nov 10 '19 at 16:51
I think you may find this interesting: https://github.com/REASY/akka-http-vs-akka-grpc, in his case (and I think this extends to yours), the bottleneck may be due to high memory usage in protobuf (de)serialization, which in turn trigger more calls to the garbage collector. – Batato Dec 24 '19 at 11:23
just by curiosity, what was your issue after all? – Batato Dec 25 '19 at 16:21
see my answer to this question – simpadjo Dec 25 '19 at 20:58

score 0 · Answer 3 · answered Nov 14 '19 at 03:30

I'm quite impressed with how good Akka TCP has performed here :D

Our experience was slightly different. We were working on much smaller instances using Akka Cluster. For Akka remoting, we changed from Akka TCP to UDP using Artery and achieved a much higher rate + lower and more stable response time. There is even a config in Artery helping to balance between CPU consumption and response time from a cold start.

My suggestion is to use some UDP based framework which also takes care of transmission reliability for you (e.g. that Artery UDP), and just serialize using Protobuf, instead of using full flesh gRPC. The HTTP/2 transmission channel is not really for high throughput low response time purposes.

GRPC: make high-throughput client in Java/Scala

3 Answers3

Linked