0

For this question, I'm going to assume every character is single-byte ascii. If my understanding is correct, endianness applies to the byte-ordering of multi-byte words. Because strings only have one byte per character there is no endianness.

But this becomes a bit confusing to me, as strings are often stored with a nul character at the 'end' of the string, and wouldn't that make matter with respect to endianness? As an example,

.data
my_string: .asciz "Save"

Now in gdb to print the memory locations of S a v e:

>>> x/cb &string
0x4000b9:   'S'
>>> x/cb (char *) &string+1
0x4000ba:   'a'
>>> x/cb (char *) &string+2
0x4000bb:   'v'
>>> x/cb (char *) &string+3
0x4000bc:   'e'          # LSB at highest memory address (big endian??)

Isn't the string here essentially 'big endian' because the least significant byte (e) is stored at the highest memory address (string+3)?

What part am I missing with how endianness does or does not apply to strings? I think perhaps I may be mistaking char-array indexing for endian-ness but an answer to clearly point that out would be great.

carl.hiass
  • 1,526
  • 1
  • 6
  • 26
  • 1
    No character of a string is more or less significant than the other. – Ross Ridge Sep 20 '20 at 01:50
  • @RossRidge would you care to explain that a bit more, perhaps in an answer, to explain where my understanding may be wrong with the above? Maybe what I'm thinking of endianness is actually array-indexing where it always increases in memory address with the array offset? – carl.hiass Sep 20 '20 at 01:53
  • 3
    Think about it like this `'c'` `'o'` `'w'` `0`. Those are the 4-bytes that make up the string for `"cow"`. Each ASCII character, as you correctly indicate, is a single byte. The final byte, the *nul-character* is simply ASCII `0` (zero). Made up of characters, there is no endianess effect at all. (that is how you serialize data to make it portable across machines -- store it as a sequence of characters). Where you may be confused is that multbyte types are subject to the endianess of the hardware. If our string was an `int` on little-endian it would be `0` `'w'` `'o'` `'c'`. – David C. Rankin Sep 20 '20 at 02:05
  • Processors cannot read a string with a single load instruction, you must access them one character at a time. In other words, strings behave like an array. First letter is at array index 0, next at 1, etc, everybody agrees that is the natural way. There may well be an endianness concern when loading a character. Not for an .asciz string since they take only a single byte per char. But it matters for encodings like UTF-16 and UTF-32. – Hans Passant Sep 20 '20 at 02:06
  • 2
    You've labelled one of the characters as being the least significant in the string without explaining why. Your choice seems arbitrary. Why couldn't any of the other characters, `S`, `a`, or `v` been the least significant? Any arbitrary significance you put on the characters is equally valid. – Ross Ridge Sep 20 '20 at 02:06
  • @RossRidge I see what you're saying, I usually see the 'right-most' byte/digit as least-significant, like '5' in the decimal '45' or '1' in the binary 0b001 (though I know I'm not speaking of bytes with those two examples, just the concept). – carl.hiass Sep 20 '20 at 02:11
  • 1
    With integers the least significant byte contributes least to the value of the number. When we write out numbers in English the right most digit is has the least significant contribution to the value. When we store integers in memory the least significant byte has the smallest contribution to the value of the integer. Whether this least significant byte is stored first or last in memory determines endianess. – Ross Ridge Sep 20 '20 at 02:18
  • Processors generally have to handle two kinds of data: fixed length, and variable length. But they only do fixed length really well. The variable length items are usually converted to fixed length by using pointer to memory. Thus, the processor works with fixed length pointers and individual elements in processing variable length items like strings. – Erik Eidt Sep 20 '20 at 02:48

2 Answers2

4

Endianness can only exist for data that you access as individual bytes and with some larger access size. If you were doing DWORD loads of string data, endianness would determine whether register &= 0xFF would isolate the first or last char of the string. (On x86, movzx edx, al would isolate the first byte because x86 is little-endian: lower-address bytes end up in positions in a register that are closer to being right-shifted out.)

If you're not doing that, and just looking at the address of each byte, the entire concept of endianness does not apply. The bytes are in the order dictated by their addresses. Including the ASCII NUL '\0' aka 0 byte at the end. It's not special in this respect.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
4

The address space in this case is based on bytes, the individual addresses point at bytes. So you cannot have endianness with a byte quantity, it has to be multiple bytes.

If you have

0x1000 'S'  (0x53)
0x1001 'a'  (0x61)
0x1002 'v'  (0x76)
0x1003 'e'  (0x65)

There is no endianness there. A string is individual bytes that represent characters in memory linearly in sequential addresses.

If you were to examine those BYTES, no longer as characters but as bytes with say a 32 bit WORD view then

0x1000: 0x53617665 is a typical big endian view
0x1000: 0x65766153 is a typical little endian view

For the same data at the address 0x1000 when you do a 32 bit read. This is not a string at this point it is bytes being viewed 32 bits at a time at some address. It is an AND thing if you are trying to view/use the data as bytes AND a larger quantity, two views of the same data for some reason. An ASCII string is not something we view like that.

Note strings, integers, floats, booleans, addresses, all the data types are irrelevant to the processor, bits is bits, they only mean something to the processor as well as user when used. Otherwise they are just bits with no meaning. You can "copy" a(n ASCII) "string" by doing word reads and writes like a memcpy() and yes to you it is a string, but it is just bytes being copied, for example. Big or little endian does not matter all of the bytes are picked up and put down in groups and it will still look like a string when viewed as a linear string of bytes by that processor and its addressing.

There are exceptions to these general statements based on processors that have different endian modes and various other a-typical situations that I have certainly experienced but don't need to confuse things here. The general understanding is the low address byte is either the most significant (big endian) or least significant (little endian) byte in an access that is sized in multiple bytes (16 bit, 32 bit, 64, bit etc). Assuming a byte is 8 bits for your system, 9 and other size bytes would not change this would just change the size of the accesses.

The biggest problem with endianness is that folks try to over complicate it. "OMG this is a X endian processor, I am used to a Y endian processor, it is going to make my life difficult I am going to have to play games with addressing, and do all this extra work." Nope, in general you just created a problem that was not there and now you have bugs you have to fix.

The right answer is to understand the system first, do not think of that e-word, then when you see the busses or the peripherals and their interfaces or the data objects you need to move around from network or filesystems, etc. Then you compare them to the e-word of your computer and decide from a system engineering perspective does this already fit into the e-word of this system if I do this access to this thing, or do I need to shift or byte swap or otherwise convert the data so that when I perform operation X on that data it is oriented right. If you do not have to perform an actual operation, addition of some numbers, etc do you even care? If you are simply transferring data from point A to point B and the system engineering shows that there is no data manipulation required (reading a file from a hard drive and transmitting it over the network), then you do not need to think about or talk about the e-word.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168