How to minimize protocol buffer memory allocations for large # of messages?

Question

My application uses protocol buffers and has a large number (100 million) of simple messages. Based on callgrind analysis, a memory allocation & deallocation is being made for each instance.

Consider the following representative example:

// .proto
syntax = "proto2";
package testpb;

message Top {
    message Nested {
        optional int32 val1 = 1;
        optional int32 val2 = 2;
        optional int32 val3 = 3;
    }
    repeated Nested data = 1;
}

// .cpp
void test()
{
    testpb::Top top;
    for (int i = 0; i < 100'000; ++i) {
        auto* data = top.add_data();
        data->set_val1(i);
        data->set_val2(i*2);
        data->set_val3(i*3);
    }
    std::ofstream ofs{"file.out", std::ios::out | std::ios::trunc | std::ios::binary };
    top.SerializeToOstream(&ofs);
}

What is the most effective option for changing the implementation such that the # of memory allocations are not linear with the # of Nested instances?

score 2 · Accepted Answer · answered Nov 08 '22 at 15:52

I would suggest using Arena allocations which were designed for exactly this purpose. https://developers.google.com/protocol-buffers/docs/reference/arenas

Memory allocation and deallocation constitutes a significant fraction of CPU time spent in protocol buffers code. By default, protocol buffers performs heap allocations for each message object, each of its subobjects, and several field types, such as strings. These allocations occur in bulk when parsing a message and when building new messages in memory, and associated deallocations happen when messages and their subobject trees are freed.

Arena-based allocation has been designed to reduce this performance cost. With arena allocation, new objects are allocated out of a large piece of preallocated memory called the arena. Objects can all be freed at once by discarding the entire arena, ideally without running destructors of any contained object (though an arena can still maintain a "destructor list" when required). This makes object allocation faster by reducing it to a simple pointer increment, and makes deallocation almost free. Arena allocation also provides greater cache efficiency: when messages are parsed, they are more likely to be allocated in continuous memory, which makes traversing messages more likely to hit hot cache lines.

To get these benefits you'll need to be aware of object lifetimes and find a suitable granularity at which to use arenas (for servers, this is often per-request). You can find out more about how to get the most from arena allocation in Usage patterns and best practices.

This would change your allocations to look more like

google::protobuf::Arena arena;
testpb::Top* top = google::protobuf::Arena::CreateMessage<testpb::Top>(&arena);

Interesting, but does that imply it is best used for trivially destructable (or POD) types only? — Pepijn Kramer, Nov 08 '22 at 16:03
@PepijnKramer There is not a requirement that the types be POD, these Arenas were specifically designed to (de)allocate protobuf objects which are neither POD nor trivially destructable. — Cory Kramer, Nov 08 '22 at 16:04
Thanks, just asking. Kind of started actively writing some protobuf things today. Good to keep this in the back of my mind :) — Pepijn Kramer, Nov 08 '22 at 16:05

How to minimize protocol buffer memory allocations for large # of messages?

1 Answers1