Is there a method to control Apache Arrow Batch Sizes?

Question

I'd like to understand if there's a mechanism to control batch sizes being sent from server to client.

I've implemented the python server from the Github repo and a basic F# client.

As a test, I've added a flight containing 1 million rows which I'd like to send back to the client. At first, the client fails with the following GRPC exception.

One or more errors occurred. (Status(StatusCode="ResourceExhausted", Detail="Received message exceeds the maximum configured message size."))

As suggested, the message size has been exceeded. As a fix, I can set the maximum allowed grpc message size to be unlimited i.e.

let ops = new GrpcChannelOptions()
ops.MaxReceiveMessageSize <- Nullable()
let downloadChannel = GrpcChannel.ForAddress(uri, ops)
let downloadClient = new FlightClient(download_channel)

However, I'd like to understand if there's a way to set the batch size being sent to the client from the server i.e. in the do_get method of the server

def do_get(self, context, ticket):
    key = ast.literal_eval(ticket.ticket.decode())
    if key not in self.flights:
        return None
    return pyarrow.flight.RecordBatchStream(self.flights[key])

I'd like to set the batch size when creating pyarrow.flight.RecordBatchStream. Looking at the documentation, the options specified using pyarrow.ipc.IpcWriteOptions doesn't allow the batch size to be set?

Thanks in advance for any help :)

UPDATE - see the accepted answer below which led me down the correct path. I've updated my code as follows to fix the issue.

def do_get(self, context, ticket):
    key = ast.literal_eval(ticket.ticket.decode())
    if key not in self.flights:
        return None
    reader = pyarrow.ipc.RecordBatchReader().from_batches(self.flights[key].schema, pyarrow.Table.to_batches(self.flights[key]))
    return pyarrow.flight.RecordBatchStream(reader)

score 2 · Accepted Answer · answered Mar 09 '22 at 13:11

Assuming self.flights[key] is a pyarrow.Table, you can re-chunk it ahead of time with Table.to_batches. (This won't copy data, it'll just re-slice the underlying arrays.)

Note the size is in rows, which depending on the data type may not correspond well to bytes. This is an unfortunate mismatch. You can use get_total_buffer_size to (cheaply) estimate the byte size and split batches further as needed (though if you have something like a single 4MB string, you're out of luck).

Your suggestion led me down the correct path - I've updated the original post with the correct code. Thank you! — Christopher Dunderdale, Mar 16 '22 at 10:32

Is there a method to control Apache Arrow Batch Sizes?

1 Answers1