3

What would be the most efficient way to read a UInt32 value from an arbitrary memory address in C++? (Assuming Windows x86 or Windows x64 architecture.)

For example, consider having a byte pointer that points somewhere in memory to block that contains a combination of ints, string data, etc., all mixed together. The following sample shows reading the various fields from this block in a loop.

typedef unsigned char* BytePtr;
typedef unsigned int UInt32;

...

BytePtr pCurrent = ...;

while ( *pCurrent != 0 )
{
    ...

    if ( *pCurrent == ... )
    {
        UInt32 nValue = *( (UInt32*) ( pCurrent + 1 ) );    // line A

        ...
    }

    pCurrent += ...;
}

If at line A, pPtr happens to contain a 4-byte-aligned address, reading the UInt32 should be a single memory read. If pPtr contains a non-aligned address, more than one memory cycles my be needed which slows the code down. Is there a faster way to read the value from non-aligned addresses?

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
xxbbcc
  • 16,930
  • 5
  • 50
  • 83

3 Answers3

4

I'd recommend memcpy into a temporary of type UInt32 within your loop.

This takes advantage of the fact that a four byte memcpy will be inlined by the compiler when building with optimization enabled, and has a few other benefits:

  • If you are on a platform where alignment matters (hpux, solaris sparc, ...) your code isn't going to trap.
  • On a platform where alignment matters there it may be worthwhile to do an address check for alignment then one of a regular aligned load or a set of 4 byte loads and bit ors. Your compiler's memcpy very likely will do this the optimal way.
  • If you are on a platform where an unaligned access is allowed and doesn't hurt performance (x86, x64, powerpc, ...), you are pretty much guarenteed that such a memcpy is then going to be the cheapest way to do this access.
  • If your memory was initially a pointer to some other data structure, your code may be undefined because of aliasing problems, because you are casting to another type and dereferencing that cast. Run time problems due to aliasing related optimization issues are very hard to track down! Presuming that you can figure them out, fixing can also be very hard in established code and you may have to use obscure compilation options like -fno-strict-aliasing or -qansialias, which can limit the compiler's optimization ability significantly.
Peeter Joot
  • 7,848
  • 7
  • 48
  • 82
  • Thanks - I must admit, I didn't know about the strict aliasing rule. That also explains the other answers, too. I'm using VC++ & only x86/x64 so portability is no concern in this case at all. Much of the code is tied to the hardware for other reasons. – xxbbcc Jan 26 '12 at 08:52
3

Your code is undefined behaviour.

Pretty much the only "correct" solution is to only read something as a type T if it is a type T, as follows:

uint32_t n;
char * p = point_me_to_random_memory();

std::copy(p, p + 4, reinterpret_cast<char*>(&n));

std::cout << "The value is: " << n << std::endl;

In this example, you want to read an integer, and the only way to do that is to have an integer. If you want it to contain a certain binary representation, you need to copy that data to the address starting at the beginning of the variable.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • Exactly what is undefined about my code? It reads a stream of bytes and based on certain bytes, it knows that an integer follows. Unless I'm overlooking something, your example has nothing to do with my question. – xxbbcc Jan 25 '12 at 05:35
  • It's undefined behavior according to the C++ standard minimum requirements for implementations, true, but on Windows x86 and x64 it works. – Ben Voigt Jan 25 '12 at 05:36
  • @xxbbcc: The left-most `*` on your Line A triggers undefined behaviour, because you are dereferencing something which isn't actually a pointer to the advertised type. – Kerrek SB Jan 25 '12 at 05:37
  • @KerrekSB Maybe I'm misunderstanding it but as far as I'm aware, that left-most `*` simply grabs sizeof(UInt32) bytes from the pointer. Why do you say this is wrong? – xxbbcc Jan 25 '12 at 05:41
  • @xxbbcc - You are reading an int from a pointer to char. On some hardware this will trap if the pointer isn't properly aligned, so the language can't guarantee anything (UB). At best it is implementation defined and working for x86 hardware. Like you say, an x86 will read two words and pick the bits you need. On some other systems you will get a bus error. – Bo Persson Jan 25 '12 at 09:52
  • @BoPersson Thanks for the explanation - I'm aware of that, that's why I added x86 & x64 to my post. The code I'm working on is specifically for these architectures and any optimization is acceptable - it should probably be written in assembly but time doesn't permit. – xxbbcc Jan 25 '12 at 13:45
  • Too much ado about nothing. Technically it's undefined, but I dare you to find an architecture in common use today where this doesn't work as expected (as long as the alignment is OK). – Mark Ransom Jan 25 '12 at 23:17
  • Still you get a +1 because your solution is better than the original. – Mark Ransom Jan 25 '12 at 23:22
  • 1
    One more thing, rather than `p + 4` I'd use `p + sizeof(n)`. – Mark Ransom Jan 25 '12 at 23:26
  • @MarkRansom: Yeah, `sizeof n` might be an idea, though in practice you should probably have something that combines a check for the size of the target **and** the source data... but I trust the OP can figure that out! – Kerrek SB Jan 26 '12 at 00:47
  • @KerrekSB Thanks for the answers - at first I didn't understand why you called my code undefined but after some reading I do now. As stated, I don't care about other hardware - as long as my code runs correctly & fast on x86/x64, it's fine. I'm looking for absolute fastest, not absolute portable in this case. (Much of the app is portable but this core needs to be fast.) – xxbbcc Jan 27 '12 at 18:20
  • @xxbbcc: I'm afraid I don't know those platforms well enough to make a conclusive statement. What you need is a guarantee of some form that `*reinterpret_cast(p + i)` does the correct thing on your platform predictably. I wouldn't know where to look this up, I'm afraid. – Kerrek SB Jan 27 '12 at 19:04
0

Let the compiler do the optimizing!

UInt32 ReadU32(unsigned char *ptr)
{
    return  static_cast<UInt32>(ptr[0]) |
           (static_cast<UInt32>(ptr[1])<<8) |
           (static_cast<UInt32>(ptr[2])<<16) |
           (static_cast<UInt32>(ptr[3])<<24);
}
David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • I don't think you need the casts there. The shift operation already promotes all types to something suitably large I believe. Also, your answer is really for the question "what does this byte stream mean when interpreted as a little-endian integer". :-) – Kerrek SB Jan 25 '12 at 05:40
  • Why do you think this is faster? The address in `pCurrent` is dynamic and it changes as the data is read. It seems to me that your example performs 4 memory reads and then ORs the values together with shifts - this seems to be slower than the extra memory fetch that would happen for a non-aligned read. – xxbbcc Jan 25 '12 at 05:44
  • @xxbbcc: You're assuming the compiler doesn't optimize the code. – David Schwartz Jan 25 '12 at 06:09
  • @DavidSchwartz You may be right - I'll have to check the compiler's assembly output to see exactly what gets generated. – xxbbcc Jan 25 '12 at 13:47