Why is protobuf 5-10x slower than memcpy for a list of bytes?

Question

I've got two simple code blocks that have vastly different performance:

void testProto() {
  demo::Person* person = new demo::Person();
  person->set_data(data, BUFFER_LEN);
}

void testMemcpy() {
  demo::Person* person = new demo::Person();
  memcpy(memcpy_dest, data, BUFFER_LEN);
}

The proto file looks like this:

message Person {
  bytes data = 1;
}

According to the Protobuf encoding docs, setting length-delimited data seems as simple as copying data with a few header bytes. Why is it that the first function takes 5-10x more time than the second?

I made a full, easy to run example here.

Additional notes/context:

Flatbuffers, and alternative to protobufs, does not have this problem
Here's my attempt at using a debugger. I can't step below the Set method.
The reason this performance matters to me is that I’m converting some high throughput/low latency networking code over to protobufs. Since I’m running code like the above multiple times per packet, protobufs significantly hurts performance.
I’m running at -O3, but even at -O0, there’s still a huge performance difference
Function call overhead is not the problem because the poor performance scales with the size of the data. Function calls is just a constant overhead.
I’ve tried a variety of ways to ensure that the memcpy is not optimized away (-O0, using the array). I’m pretty confident that the memcpy is not optimized away.
I tried malloc inside of testMemcpy. That slowed things down a bit, but it's still at least 5x worse.
I tried this on a Macbook M1 and Ubuntu Intel machine

Yes, you can see that in the example I've linked. I'm running -O3. — theicfire, Dec 20 '21 at 22:43
memcpy is generally implemented as a compiler intrinsic, while set_data is layers of out-of-line function calls. Does either one have *problematically* slow performance? — Sneftel, Dec 20 '21 at 22:44
An inlined memcpy of known size and correct alignment can copy perhaps 8 or 16 bytes at a time. — BoP, Dec 20 '21 at 22:45
All relevant details need to be in the question itself not in external links — Alan Birtles, Dec 20 '21 at 22:45
Problematically slow: Yes, I'm replacing some high-throughput networking with protobufs and I'm finding that this is a bottleneck. — theicfire, Dec 20 '21 at 22:48
What is the `memcpy` copying and to where? It's clearly not a `demo::Person`. Perhaps the `memcpy` is just optimized away because noone looks at the result? — Ted Lyngmo, Dec 20 '21 at 22:48
Function calls: That likely is not the problem because the poor performance scales with the size of the data. Function calls is just a constant overhead. I'll update the description to clear that up. — theicfire, Dec 20 '21 at 22:48
The code is not optimized away. Here's how I know why: Running with -O0 still has the same 5-10x difference in performance. In addition, regardless of what I do with the memcpy to make it not optimized away, it's still fast. — theicfire, Dec 20 '21 at 22:49
I've taken many of these guesses myself :). No, swapping the order does not change the outcome. — theicfire, Dec 20 '21 at 22:51
It seems like there's something in the *implementation* of protobufs that is not simply a memcpy. I just don't know what, though. I tried attaching a debugger but it didn't go deep enough to give me a helpful answer. I may try again now that I have a simple example though. — theicfire, Dec 20 '21 at 22:52
Also responding to @BoP - why would protobufs not be able to have the same benefits of 8 or 16 bytes at a time? — theicfire, Dec 20 '21 at 22:53
Looking inside the generated protobuf header it looks like it does quite a lot. I think the answer to why it's slower is in there somewhere. — Ted Lyngmo, Dec 20 '21 at 22:57
@TedLyngmo yeah I've tried digging into the protobuf code but it's pretty challenging. Ultimately there's a `SetBytes` call that just calls `Set`, and I lose the trail there because there are so many `Set` functions. — theicfire, Dec 20 '21 at 23:02
@theicfire - I am just guessing here, but adding a header to the protobuf would make the rest of the copying unaligned. And not being inlined would lose the advantage of having a known constant size. `testMemcpy` has all the advantages. — BoP, Dec 20 '21 at 23:06
Protobuf is a lot more than just byte copying. For example, your memcpy already has a buffer (static!) to copy into. If I modify your benchmark so memcpy also has to allocate a place to store the bytes, like protobuf does, then the difference becomes much smaller. Still, proto is ~2x slower than memcpy - or in concrete terms, about 127 microseconds slower per iteration. Given that it also manages an allocator arena, varint encodes length, and tracks other message headers, this seems somewhat reasonable. If your bottleneck is copying single byte buffers around, protobuf is not the fastest. — GManNickG, Dec 20 '21 at 23:08
@GManNickG thanks for the idea! I thought about this too but wasn't able to see what you're seeing. I tried `std::vector dest(BUFFER_LEN)` and copying to `dest.data()`, and still see a 5-10x difference. What changes did you make? — theicfire, Dec 20 '21 at 23:13
fwiw protobuf's arena allocation doesn't help: https://developers.google.com/protocol-buffers/docs/reference/arenas — theicfire, Dec 20 '21 at 23:15
@BoP humm, thanks for the insight. I added `+1`, and `+2` to the memcpy to try to prevent that (I think that would?). It's still 5-10x faster than protobuf. — theicfire, Dec 20 '21 at 23:28
@theicfire Let's move to a chat and I can give you some profiling tips. Wish this thing would pop up with the "want to start a chat?" thing, though... — GManNickG, Dec 20 '21 at 23:53
@GManNickG how do I start a chat? I've never used chat.stackoverflow.com before! — theicfire, Dec 20 '21 at 23:59
Well, usually if you reply back and forth enough it just gives you a link, but in this case I guess not. https://chat.stackoverflow.com/rooms/240296/why-is-protobuf-5-10x-slower-than-memcpy-for-a-list-of-bytes — GManNickG, Dec 21 '21 at 00:00
With the help of GManNickG, we found out that the main issue with my demo is that memory is not being freed in `testProto`. Freeing that memory results in a 4x speedup. It's still unclear to me why protobufs is slower than memcpy (and gets even slower the bigger the message), but the 4x improvement is enough for me at the moment! — theicfire, Dec 21 '21 at 01:15

viraltaco_ · Answer 1 · 2021-12-20T23:36:26.610

-1

The code in your benchmark is invalid. The program is ill-formed.
If it wasn't, the [as-if rule] would apply.
There is no observable behavior differences between calling your testMemcpy() function, and doing nothing at all. (Besides allocating memory that cannot be dealocated; That can be ignored, it's undefined behavior).

edited Dec 20 '21 at 23:36

answered Dec 20 '21 at 23:30

viraltaco_

814
5
14

We discussed this in the comments. This is true, but even if it's correctly formed such that we ensure `testMemcpy` is not optimized out, the performance difference persists. There's something else going on. – theicfire Dec 20 '21 at 23:41
I also mentioned this in the question: "I’ve tried a variety of ways to ensure that the memcpy is not optimized away (-O0, using the array). I’m pretty confident that the memcpy is not optimized away." – theicfire Dec 20 '21 at 23:42
1

@ViralTaco_ Where is the undefined behavior in the given problem? Posting an answer that says, without demonstration, "the code is invalid" is not useful. – GManNickG Dec 20 '21 at 23:45
-O0 doesn't change the standard. There is NO OBSERVABLE difference in behavior. Even if the compiler generates assembly, and it does get executed, and that assembly is x86-64, the processor will be able to optimize useless instructions away. (however, that's, massively, off-topic). My point was: you cannot use that code as a benchmark. Please: 0) Either deallocate the memory or don't dynamically allocate it. 1) Implement observable behavior, so you can measure something. @GManNickG example of what? I explained the reason, in parentheses, at the end. It's unrelated to the question. – viraltaco_ Dec 20 '21 at 23:56
1

"The program is ill-formed." and "it's undefined behavior" have actual meaning. The problem appears well-formed to me (and my compiler), and contains no undefined behavior that I see. – GManNickG Dec 20 '21 at 23:59
Is it well-formed? What's the behavior of a program with variables for which the lifetime ends after program execution does? This is a genuine question. I don't remember, but I believe (if my memory serves me right) the standard mentions it, somewhere (might be in program execution [termination?], rather than object lifetime). The point is: why is it dynamically allocated, and why is that memory then left to rot? It's (likely) a programming error. – viraltaco_ Dec 21 '21 at 01:37
@ViralTaco_ You're allowed to leak whatever you want, even things with destructors, permitted you do not rely on those destructors to induce well-defined behavior (e.g. a destructor that sets some index that some other destructor uses to go into an array). Either way, we live in the Real World and this doesn't explain the latency discrepancy of the produced binary in any concrete terms. – GManNickG Dec 21 '21 at 01:44
I did explain the latency difference. First, in `testProto()` you have a heap allocation, then you have a construction, then a dereference through the `this` to call a (non const) method, then you have a copy taking place. in the `testMemcpy` you have nothing whatsoever, there is no observable behavior produce by running that function, it is (can be) a noop. Fine, you win, I'll write a benchmark. – viraltaco_ Dec 21 '21 at 06:12

Why is protobuf 5-10x slower than memcpy for a list of bytes?

1 Answers1