6

I wanted to extract bytes from 8 byte type, something like char func(long long number, size_t offset) so for offset n, I will get the nth byte (0 <= n <= 7). While doing so I realized I have no idea how 8 byte variable is actually represented in memory. I hope you can help me to figure it out. I first wrote a short python script to print numbers made of As (ascii value of 65) in each byte

sumx = 0
for x in range(8):
    sumx += (ord('A')*256**x)
    print('x {} sumx {}'.format(x,sumx))

The output is

x 0 sumx 65
x 1 sumx 16705
x 2 sumx 4276545
x 3 sumx 1094795585
x 4 sumx 280267669825
x 5 sumx 71748523475265
x 6 sumx 18367622009667905
x 7 sumx 4702111234474983745

In my mind each number is a bunch of As followed by 0s. Next I wrote a short c++ code to extract the nth byte

#include <iostream>
#include <array>

char func0(long long number, size_t offset)
{
  offset <<= 3;
  return (number & (0x00000000000000FF << offset)) >> offset;
}

char func1(long long unsigned number, size_t offset)
{
  char* ptr = (char*)&number;
  return ptr[offset];
}

int main()
{
  std::array<long long,8> arr{65,16705,4276545,1094795585,280267669825,71748523475265,18367622009667905,4702111234474983745};
  for (int i = 0; i < arr.size(); i++)
    for (int j = 0; j < sizeof(long long unsigned); j++)
      std::cout << "char " << j << " in number " << i << " (" << arr[i] << ") func0 " << func0(arr[i], j) << " func1 " << func1(arr[i], j) << std::endl;
  return 0;
}

Here is the program output (notice the difference starting the 5th byte)

~ # g++ -std=c++11 prog.cpp -o prog; ./prog
char 0 in number 0 (65) func0 A func1 A
char 1 in number 0 (65) func0  func1
char 2 in number 0 (65) func0  func1
char 3 in number 0 (65) func0  func1
char 4 in number 0 (65) func0  func1
char 5 in number 0 (65) func0  func1
char 6 in number 0 (65) func0  func1
char 7 in number 0 (65) func0  func1
char 0 in number 1 (16705) func0 A func1 A
char 1 in number 1 (16705) func0 A func1 A
char 2 in number 1 (16705) func0  func1
char 3 in number 1 (16705) func0  func1
char 4 in number 1 (16705) func0  func1
char 5 in number 1 (16705) func0  func1
char 6 in number 1 (16705) func0  func1
char 7 in number 1 (16705) func0  func1
char 0 in number 2 (4276545) func0 A func1 A
char 1 in number 2 (4276545) func0 A func1 A
char 2 in number 2 (4276545) func0 A func1 A
char 3 in number 2 (4276545) func0  func1
char 4 in number 2 (4276545) func0  func1
char 5 in number 2 (4276545) func0  func1
char 6 in number 2 (4276545) func0  func1
char 7 in number 2 (4276545) func0  func1
char 0 in number 3 (1094795585) func0 A func1 A
char 1 in number 3 (1094795585) func0 A func1 A
char 2 in number 3 (1094795585) func0 A func1 A
char 3 in number 3 (1094795585) func0 A func1 A
char 4 in number 3 (1094795585) func0  func1
char 5 in number 3 (1094795585) func0  func1
char 6 in number 3 (1094795585) func0  func1
char 7 in number 3 (1094795585) func0  func1
char 0 in number 4 (280267669825) func0 A func1 A
char 1 in number 4 (280267669825) func0 A func1 A
char 2 in number 4 (280267669825) func0 A func1 A
char 3 in number 4 (280267669825) func0 A func1 A
char 4 in number 4 (280267669825) func0  func1 A
char 5 in number 4 (280267669825) func0  func1
char 6 in number 4 (280267669825) func0  func1
char 7 in number 4 (280267669825) func0  func1
char 0 in number 5 (71748523475265) func0 A func1 A
char 1 in number 5 (71748523475265) func0 A func1 A
char 2 in number 5 (71748523475265) func0 A func1 A
char 3 in number 5 (71748523475265) func0 A func1 A
char 4 in number 5 (71748523475265) func0  func1 A
char 5 in number 5 (71748523475265) func0  func1 A
char 6 in number 5 (71748523475265) func0  func1
char 7 in number 5 (71748523475265) func0  func1
char 0 in number 6 (18367622009667905) func0 A func1 A
char 1 in number 6 (18367622009667905) func0 A func1 A
char 2 in number 6 (18367622009667905) func0 A func1 A
char 3 in number 6 (18367622009667905) func0 A func1 A
char 4 in number 6 (18367622009667905) func0  func1 A
char 5 in number 6 (18367622009667905) func0  func1 A
char 6 in number 6 (18367622009667905) func0  func1 A
char 7 in number 6 (18367622009667905) func0  func1
char 0 in number 7 (4702111234474983745) func0 A func1 A
char 1 in number 7 (4702111234474983745) func0 A func1 A
char 2 in number 7 (4702111234474983745) func0 A func1 A
char 3 in number 7 (4702111234474983745) func0 A func1 A
char 4 in number 7 (4702111234474983745) func0  func1 A
char 5 in number 7 (4702111234474983745) func0  func1 A
char 6 in number 7 (4702111234474983745) func0  func1 A
char 7 in number 7 (4702111234474983745) func0 A func1 A

This code has 2 functions, func1 which returns the expected values and func0 which I assumed it should return the same values like func1 but it doesn't and I'm not sure why. Basically I understand the 8 byte types like an array of 8 bytes, func1 clearly shows this is case in some sense. I'm not sure why using bit shifts to get to the nth byte is not working and I'm not sure I completely understand how 8 bytes variables are arranged in memory

0xsegfault
  • 2,899
  • 6
  • 28
  • 58
e271p314
  • 3,841
  • 7
  • 36
  • 61
  • 1
    This has nothing to do with the question; it's just a gratuitous tip. "0 <= n <= 7" represents a fully closed range. Get in the habit of thinking in terms of half-open ranges, i.e., "0 <= n < 8", since that's how just about everything in programming works. Especially `for` loops. – Pete Becker Jan 01 '20 at 16:46
  • Did You consider [Endianness](https://en.wikipedia.org/wiki/Endianness)? – Robert Andrzejuk Jan 01 '20 at 21:12
  • There is no problem with endianity, I just missed that numbers by default are interpreted as plain 32 bit integers unless adding LL/ULL literals, that solved my problem. However, when I'm clueless, unfortunately, I tend to come up with theories that doesn't make too much sense, today for example I actually thought 8 byte integers are somehow not 8 consecutive bytes in memory :-( – e271p314 Jan 01 '20 at 21:41
  • What has been implemented here will give the result of a representation on little-endian systems. If that is what You wanted, then ok. – Robert Andrzejuk Jan 01 '20 at 23:10

4 Answers4

8

This is an extremely overcomplicated way to do something very simple. You don't need to even consider endian issues, because you don't need to access the memory representation of a long long just to get a byte.

Getting the n-th byte is simply a matter of masking away all other bytes and doing a conversion of that value to an unsigned char. So like this:

unsigned char nth_byte(unsigned long long int value, int n)
{
  //Assert that n is on the range [0, 8)
  value = value >> (8 * n);   //Move the desired byte into the first byte.
  value = value & 0xFF;      //Mask away everything that isn't the first byte.
  return unsigned char(value); //Return the first byte.
}
Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • Thanks for helping me understand the problem and find a working solution using bit shifts. I didn't understand the compiler is not clever enough to consider 0x00000000000000FF as an 8 byte number but I guess I wasn't clever enough to understand it myself so that that makes us even :-) – e271p314 Jan 01 '20 at 17:07
5

The problem is that in the code

 0x00000000000000FF << offset

the number 0xFF on the left is just an integer (no matter how many zeros you put) that left-shifted gives an integer (actually up to the integers size... shifting more than the size of an integer is not portable code).

Using instead:

 0xFFull << offset

solves the issue (because the suffix ull tells it should be considered an unsigned long long).

Of course, as said in another answer, (number >> (offset * 8)) & 0xFF is simpler and works.

6502
  • 112,025
  • 15
  • 165
  • 265
2

The problem in func0 is that your hex literal, while containing 8 bytes of data, is being interpreted as a long because you haven't specified a precision. Use 0xffULL (0xff unsigned long long) instead of 0x00000000000000ff should get you what you want.

The clue was that it was working perfectly for the first 32 bits and fell down after that. I'm at a loss to explain where that 7th A came out of it, though.

  • Thank you very much for your answer, it is correct. I thought it was some sort of voodoo and the fact I got A for the last byte confused me even further – e271p314 Jan 01 '20 at 17:17
2

The correct way to analyze the underlying memory representation of a variable is to use memcpy and copy to a char array (ref: C aliasing rules and memcpy):

#include <cstring>

char get_char(long long num, size_t offs)
{
    char array[sizeof(long long)];

    memcpy(array, &num, sizeof(long long));

    return array[offs];
}

Then for the following example:

int main()
{
    long long var = 0x7766554433221100;

    for (size_t idx = 0; idx < sizeof(long long); ++idx)
        std::cout << '[' << idx << ']' << '=' << std::hex << static_cast<int>(get_char(var, idx)) << '\n';
}

On little-endien systems we get:

[0]=0
[1]=11
[2]=22
[3]=33
[4]=44
[5]=55
[6]=66
[7]=77

On big-endien systems we get:

[0]=77
[1]=66
[2]=55
[3]=44
[4]=33
[5]=22
[6]=11
[7]=0

(https://en.wikipedia.org/wiki/Endianness)

(https://godbolt.org/z/xrPMVw)

Robert Andrzejuk
  • 5,076
  • 2
  • 22
  • 31
  • Note that result depends of endianness. – Jarod42 Jan 01 '20 at 18:58
  • @Jarod42 exactly. If You want to see how the variable is stored in memory, this is the correct way. The other answers do not show the memory representation. With different endianess, those answers will show incorrect result. – Robert Andrzejuk Jan 01 '20 at 19:10
  • Unsure what OP actually wants (even if title mentions "memory", OP's code mismatches). As your answer goes in another direction, IMO, that fact should be pinpointed. – Jarod42 Jan 01 '20 at 19:36
  • Technically the shift and AND does exactly the same thing without the need of a function call. For instance `int nbytes = sizeof var; while (nbytes--) printf ("(var >> (%d * CHAR_BIT)) & 0xff : 0x%02x\n", nbytes, (unsigned)((var >> (nbytes * CHAR_BIT)) & 0xff));` would effectively provide the same output. – David C. Rankin Jan 02 '20 at 01:44
  • @DavidC.Rankin https://stackoverflow.com/questions/7184789/does-bit-shift-depend-on-endianness#7184905 On PowerPC the vector shifts and rotates are endian sensitive. You can have a value in a vector register and a shift will produce different results on little-endian and big-endian. – Robert Andrzejuk Jan 02 '20 at 08:50
  • @DavidC.Rankin also : bitshift, like any other part of C, is defined in terms of values, not representations. Left-shift by 1 is mutliplication by 2, right-shift is division. (As always when using bitwise operations, beware of signedness. Everything is most well-defined for unsigned integral types.) – Robert Andrzejuk Jan 02 '20 at 08:52
  • @DavidC.Rankin i prefer refactoring towards single purpose functions, the code then is easier to understand. If the compiler deems it worthwhile, it will inline the code. – Robert Andrzejuk Jan 02 '20 at 08:56
  • @RobertAndrzejuk thanks. I don't mess with PowerPC often and had no idea the shifts would be endian-sensitive there. It would be an unwise move to left-shift a two's compliment value, that I agree with. – David C. Rankin Jan 02 '20 at 08:57