how to loop faster through grpc response iterators with python

Question

I'm calling a grpc service with python that responds with about a million iterator objects. At the moment I'm using list comprehension to access the 1 attribute I need from the iterators:

stub = QueryStub(grpc_channel)
return [object.attribute_i_need for object in stub.ResponseMethod]

To access around a million attributes takes a while (around 2-3 minutes). Is there a way I can speed this up? Interested to know how people process such scenarios faster. I have also tried using list(stub.ResponseMethod) and [*stub.ResponseMethod] to unpack or retrieve the objects faster, however these approaches take even longer since the iterator objects have a lot of other metadata I don't need and its storing them.

PS I don't necessarily need to store the attributes in memory, accessing them faster is what I'm trying to achieve

I am not an expert of the grpc library, but usually the bottleneck in such scenarios is waiting for responses. This is typically solved by an asynchronous implementation. So maybe search for "async grpc" to get your answer. — Carlos Horn, Feb 23 '22 at 18:33
@ClémentJean the iterator objects that are returned in my loop are `<_MultiThreadedRendezvous object>` with repeated fields always (11 fields total, I need only 1 as outlined). This answer here suggests regular looping - https://stackoverflow.com/questions/63413200/what-is-multithreadedrendezvous-in-grpc-and-how-to-parse-it , takes a while when you have 1 million of these objects, but I would expect this to be faster given I'm just accessing 1 attribute/field and not performing any computation in the loop. — asleniovas, Feb 24 '22 at 10:57
@asleniovas the thing is protobuf need to be deserialized (not sure about their deserialization algorithm), and it's well known that the python implementation is much slower at that than other languages. I'm going to tinker with it and let you know — Clément Jean, Feb 24 '22 at 11:32

Clément Jean · Accepted Answer · 2022-02-25T06:50:55.607

According to this documentation, I would say you need to try two things:

working with asyncio API (if that's not already done) by doing something like:

async def run(stub: QueryStub) -> None:
    async for object in stub.ResponseMethod(empty_pb2.Empty()):
        print(object.attribute_i_need)

note that the Empty() is just because I do not know your API definition.

second would be to try the experimental feature SingleThreadedUnaryStream (if applicable to your case) by doing:

with grpc.insecure_channel(target='localhost:50051', options=[(grpc.experimental.ChannelOptions.SingleThreadedUnaryStream, 1)]) as channel:

What I tried

I don't really know if it covers your use case (you can give me more info on that and I'll update), but here is what I tried:

I have a schema like:

service TestService {
  rpc AMethod(google.protobuf.Empty) returns (stream Test) {} // stream is optional, I tried with both
}

message Test {
  repeated string message = 1;
  repeated string message2 = 2;
  repeated string message3 = 3;
  repeated string message4 = 4;
  repeated string message5 = 5;
  repeated string message6 = 6;
  repeated string message7 = 7;
  repeated string message8 = 8;
  repeated string message9 = 9;
  repeated string message10 = 10;
  repeated string message11 = 11;
}

on the server side (with asyncio) I have

async def AMethod(self, request: empty_pb2.Empty, unused_context) -> AsyncIterable[Test]:
    test = Test()
    for i in range(10):
        test.message.append(randStr())
    # repeat append for every other field or not
    for i in range(1000000):
        yield test

where randStr creates a random string of length 10000 (totally arbitrary).

and on the client side (with SingleThreadedUnaryStream and asyncio)

async def run(stub: TesterStub) -> None:
    tests = stub.AMethod(empty_pb2.Empty())

    async for test in tests:
        print(test.message)

Benchmark

Note: This might vary depending on your machine

For the example with only one repeated field filled, I get an average (ran it 3 times) of 77 sec.

And for all the fields being filled, it is really long so I tried providing smaller strings (10 in length) and it still takes too long. I think the mix of repeated and stream is not a good idea. I also tried without stream and I get an average (run 3 times) of 45 sec.

My conclusion

This is really slow if all the repeated fields all filled with data and this is ok-ish when only one is filled. But overall I think asyncio helps.

Furthermore, this documentation explains that Protocol Buffers are not designed to handle large messages, however Protocol Buffers are great for handling individual messages within a large data set.

I would suggest that, if I got your schema right, you rethink the API design because that seems to be not optimal.

but once again I might have not understand the schema properly.

Please be aware that just putting synchronous code into an ``async`` function doesn't magically make the workload itself async, as done in the first ``run`` function. There would definitely have to be an ``async for`` loop/comprehension to benefit from asynchronous requests and potentially an ``async with`` to asynchronously manage the channel. — MisterMiyagi, Feb 24 '22 at 16:42
@MisterMiyagi that is exactly why in the code in show after I have a `async for`. Updated the first code snippet. — Clément Jean, Feb 24 '22 at 16:44

score -1 · Answer 2 · answered Feb 24 '22 at 10:43

I would advise you to loop through the object using a for loop if you haven't already done it anyway. But something needs to be said about that: It is important to realize that everything you put in a loop gets executed for every loop iteration. They key to optimizing loops is to minimize what they do. Even operations that appear to be very fast will take a long time if the repeated many times. Executing an operation that takes 1 microsecond a million times will take 1 second to complete.

Don't execute things like len(list) inside a loop or even in its starting condition.

example

a = [i for i in range(1000000)]
length = len(a)
for i in a:
   print(i - length)

is much much faster than

a = [i for i in range(1000000)]
for i in a:
   print(i - len(a))

You can also use techniques like Loop Unrolling(https://en.wikipedia.org/wiki/Loop_unrolling) which is loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space-time tradeoff.

Using functions like map, filter, etc. instead of explicit for loops can also provide some performance improvements.

how to loop faster through grpc response iterators with python

2 Answers2

What I tried

Benchmark

My conclusion