5

In the internals of snappy, there is a conditionally compiled section that selects dereferencing a reinterpret_cast'ed pointer as the best implementation for reads and writes of potentially unaligned 16, 32, and 64 bit integers on architectures that are known to support such operations (like x86). The fallback for other architectures is to use a memcpy based implementation.

My understanding is that the reinterpret_cast implementation exhibits undefined behavior, and clang's undefined behavior sanitizer does flag it.

What is puzzling me though is: why not just use the memcpy based implementation? I would expect all but the most broken of compilers to use intrinsics to implement these memcpy calls, since the size is known at compile time. In fact I would expect identical codegen from both implementations on any modern toolchain.

However, I also recognize that snappy was written by folks who know what they are about. So this leaves me wondering whether there is still some advantage to using the reinterpret_cast mechanism that outweighs its being undefined behavior. Not wanting performance to depend on compiler quality of implementation? Something else I haven't considered?

acm
  • 12,183
  • 5
  • 39
  • 68
  • 1
    Sounds like a minefield. There isn't a language-intrinsic, correct way to even *obtain* an object at an unaligned address in the first place, so the only use case for this that I can see is I/O, so this code appears to be wrapping things like "deserialize an int". – Kerrek SB Feb 09 '14 at 18:53
  • @KerrekSB Yes, this is definitely IO or protocol/format cracking code. However, see 3.9.2 [basic types] where it explicitly states that you can round trip valid instances of trivially copyable types (which the explicit width integral types certainly are) through char arrays, without any reference to alignment of the char array bytes. So the memcpy implementation is definitely defined behavior. And the reinterpret_cast is definitely undefined. – acm Feb 09 '14 at 19:08
  • The comments accompanying [this commit](https://code.google.com/p/snappy/source/detail?r=59&path=/trunk/snappy-stubs-internal.h) are interesting. Apparently it does make a difference on performance. – T.C. Aug 22 '14 at 20:36
  • @T.C. fixed link after move: https://github.com/google/snappy/commit/f8829ea39d51432ba4e6a26ddaec57acea779f4c if there were any comments other than the commit msg, they must've been lost in the migration. – underscore_d Jul 10 '16 at 22:04

2 Answers2

4

Without knowing the programmer(s) who wrote that code in the first place, I doubt you can get a truly authoritative answer.

Here's my best guess: the authors didn't want to rely on a possible memcpy optimization (which is in no way guaranteed by the spec, even if it is implemented by many compilers). On the flip side, writing a reinterpret_cast is very, very likely to produce simply the unaligned access instruction that the authors were expecting, on practically any compiler.

While smart, modern compilers will optimize the memcpy, older ones may not. Consistent performance can be quite critical to this library, so they seem to have sacrificed some correctness (since the reinterpret_cast appears to be potentially UB) in favour of obtaining more consistent results across a wider set of compilers.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
-1

The reason is that it's faster (on x86) to load an int from an unaligned address than to copy it and then load it.

The overhead of an unaligned load is about a factor 2. The memcpy boils down to 4 byte reads, 4 byte writes (or one 32 bit write, depending on compiler), and then you still need the load. In the best case, the optimizer may spot that the write-after-read is redundant.

Personally, I'd implement the safe method as 4 byte loads with shifts.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    I find that in many cases that I have examined, the two strategies produce identical code for x86, using GCC 4.8.1 on Linux targeting x86_64. – acm Feb 10 '14 at 13:06
  • 1
    In fact, using the memcpy strategy seems to let you generate better code, because with the reinterpret_cast approach you need to use -fno-strict-aliasing, whereas you don't with memcpy. – acm Feb 10 '14 at 13:34