Big-little endian details when reading bytes (memory location)

Question

Let's suppose to use this 32-bit number 66 DD FA EB.

Little Endian

0 x 100 ---> 66 (MSB)

0 x 99 ---> DD

0 x 98 ---> FA

0 x 97 ---> EB (LSB)

Big Endian

0 x 100 ---> EB (LSB)

0 x 99 ---> FA

0 x 98 ---> DD

0 x 97 ---> 66 (MSB)

I believe to have understood the correct order which bytes are stored in.

However my question is, which byte is read first when this 4-bytes word is accessed ?

is always the byte with the lowest address to be read first and then positioned so as to be the most significant or least significant depending on the endian of the computer?

I try to explain myself better. Take the example above of LITTLE ENDIAN.

The correct order which I wanna get is 66 DD FA EB

Now to achieve this, I could read the byte with the lowest address and place it on the right of the page, then continue to the left for the following bytes.

         EB 

      FA EB

   DD FA EB

66 DD FA EB

Result is 66 DD FA EB and that's correct.

or I could read from the byte with the highest address, place it on the left of the page, and continue to the right for the following bytes

66

66 DD

66 DD FA

66 DD FA EB

Result is Always 66 DD FA EB, and this is correct.

Since the memory address of the word is the address of the first byte(lowest address), independently of the endian, I assume that it is always the first byte(lowest address) to be read first, and then positioned in the correct order.

So my final question is,

which of the two modes I showed before is used by a computer in little endian to read the 4bytes-word in the proposed exemple ?

It's a theoretical question, that's not how cpus work. They read words or even bigger units (cache lines). But the first version makes more sense. As you said, you start at low address. — Jester, May 06 '19 at 15:36
thank you for your answer. being a word a set of bytes, even if it is read in one time, I think there is always a starting point from which to start reading, right? — Antonio, May 06 '19 at 16:09
The address after 0x99 is not 0x100; it is 0x9a. 32-bit values are typically placed starting at an address that is a multiple of 4, not an odd address, such as 0x97. It would be clearer (to me) for the example to use byte addresses 0x100, 0x101, 0x102, and 0x103. — prl, May 06 '19 at 16:32
In a modern CPU, the address to start reading from memory into cache would be 0xc0. It would read all bytes 0xc0 through 0xff at one time. Once the line is in the cache, the bytes needed by the instruction would be transferred all at one time. (Unless the 4-byte value spans two cache lines, in which case it may take more than one cycle.) — prl, May 06 '19 at 16:35
thanks, so using 0 x 100, 0 x 101 , 0x 102 , 0 x 103. Which of this bytes is read first ? — Antonio, May 06 '19 at 16:43
None are read first. They are all read at the same time. Unless you're talking about a really old CPU with a small (16 or 8 bit data bus), then it depends on the CPU. Usually low bytes are read first but that is not necessarily the case. — 1201ProgramAlarm, May 06 '19 at 17:02
ok I understand they are read as a single unit, but my question is refered to the order of bytes — Antonio, May 06 '19 at 17:11
hi again :) no, your understanding is wrong - on a 32-bit/64-bit CPU if CPU reads a word - it reads the whole word as a single unit. There is no "starting from which byte" concept. When you take a box of matches - you take the whole box, you don't start with an individual match to do that. — tum_, May 06 '19 at 17:25

score 1 · Answer 1 · answered Feb 13 '22 at 18:24

Modern CPUs typically read all 4 bytes at once, from cache or a wide-enough external data bus.

When they read a whole cache line, there is a related concept of "critical word first", i.e. stream data from memory starting with the 8-byte chunk that includes the data you're actually waiting for (the demand load that missed in cache), so the load can complete and forward data to the next instruction waiting for it ASAP.

But this is still assuming that the bus width / transfer chunk size is at least as wide as a load, so we're not getting into details of which half of a load is loaded first if you have a narrow bus (like on a 386 SX for example).

Modern CPUs have internal busses up to 64 bytes wide (e.g. between L2 and L1d cache in Skylake, and between SIMD load/store units and L1d cache in Skylake-avx512 and later Intel CPUs). DDR SDRAM transfers 64 bytes in 8x 8-byte transfers over the 64-bit external bus.

Fun fact, little-endian was originally invented by Intel for the 8008, which was going to be, or going to replace, a CPU with bit-serial ALU and address calculation. Since carry propagates from low to high when doing addition / subtraction, a bit-serial machine needs to start with the low bit first. For instructions that contain an absolute address (along with an opcode?) that meant putting the low bit in the first byte and the high bits in the next byte, so it could overlap ALU / address-calc cycles with cycles of reading one bit at a time.

8008 ended up not being a bit-serial design, but the ISA was designed to leave that option open. See Why is x86 little endian? for some quotes.

But anyway, a bit-serial machine wants to load the low bit first. (Over a 1-bit bus, using shift-registers for storage instead of normal RAM).

score 0 · Answer 2 · edited Feb 13 '22 at 15:01

One thing has nothing to do with the other as mentioned a few times now in comments. Main memory busses are at least 32 bits wide but more commonly 64 and all transactions are those sizes. Well, that's not true either and that is more to the point. There are many busses, there is one/some on the edge of the processor core then you go through mmus and caches and into chip level busses then off chip busses. On chip you are at least 32 bits wide but more likely 64 or a power of two multiple of that. When you go off chip into system memory usually dram then it depends on the technology. the controllers and various busses between the cache and the external interface manage the number of bus transactions for each layer. On a server or a desktop it is likely that you have 64 or 72 bit interfaces (the latter with ecc). For a phone ddr4 is a 16 bit bus so your cache line is broken into many 16 bit transactions or 32 since its ddr (data on either half clock cycle not just one bit per clock cycle, thus the name). So you can count that as 16 or 32. Your laptop might have a 32/40 bit memory stick, but I have not counted pins in a while.

Per your question though if you go back in your wayback machine, it still depends, a 32 bit transaction would likely have been handled by a 32 bit bus and 32 bits of memory be it 4 8 bit busses or 2 16s or a 32. But even if mismatched then it is up to the system designer as we didn't have a lot of system on chip then it was a cpu then peripherals and such were on other chips. So the memory controller designer could choose which byte transaction to do first. And it would be generally arbitrary.

Same goes for little endian. There there is the problem if what is your definition of big vs little. How you opened your question makes no sense whatsoever to me....0x11223344 a 32 bit value 0x44 is the least significant byte and that is always true, endianness has nothing to do with it. Now its ADDRESS when considered in a byte addressable fashion, might end with 2b00 or 2b11, that's your endianness. But we have processors that are byte invariant be8 and ones that are word invariant be32 which is what you normally think of but, arm has switched to byte invariant so words are swapped, but bytes are at the same address (depending on transaction size). Crazy right? then there is of course bit ordering where 0x44 might be bits 7..0 and some CPUs consider the 0x44 bits 24..31.

The bottom line is there no fixed answer it is purely up to the system level designers and that doesn't mean the CPU folks alone, you could take a 68K and hook it up to an 8 bit memory and pull a word sized transaction out in any order you like A[1:0] = 2b11, 2b10, 2b01, 2b00 or 2b00, 2b01, 2b10 2b11, up to you doesn't matter and doesn't affect the processor itself as you feed it the full word once you gather the bytes.

Systems that were designed or in the era of 8 bit busses didn't necessarily have 32 bit transactions. maybe 16 but you would be more likely to put two 8 bit parts out there and control the output and write enables rather than add wait states, esp in that era. Thus the 8088 vs 8086. One built to be used with an 8 bit wide memory, single part perhaps, the other for 16 bit wide systems initially two 8 bit wide parts evolving into 16 bit wide parts as demand increased changed.

This question has no single answer.

Big-little endian details when reading bytes (memory location)

2 Answers2