Why even big-endianness computers read from lower to higher memory? For big-endianness opposite could be more optimal

Question

I've read about endianness in wiki and tried to search for my question, found that post Does the endianness affect how structure members are stored into the memory where it is explained endianness does not affect sequence of structure members in memory (from lower to higher) for C.

Also in wiki:

The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses

but that is only if we read from small to large address.

I'm wondering where (what architectures / languages) use sequence of memory from higher (to be clear: larger address) to lower? It would make big-endianness on such have same beneficial properly mentioned in wiki quoted above.

Also e.g. it could mean in language similar to C malloc return largest address and the program will fill memory by doing received_address-- not ++(hope I made myself clear).

I could not find by web search why computer development did not go that route (to read memory from large to small address) (cause if that phrase in wiki is correct, it indeed did not go).

First I thought humans are used to read books from 1st page. Then I thought maybe it would be good to have books start with biggest page number, that way we know right from the start how thick it is! — Alex Martian, Sep 29 '19 at 02:40
No, because `malloc` can allocate a region larger than a CPU word. C doesn't even expose the machine's endianness directly. (Although there's no UB in accessing the bytes of an object representation using `unsigned char`, this isn't something you'd normally do in C so there's no reason that the `malloc` API would be optimize for it for 1-byte allocations.) — Peter Cordes, Sep 29 '19 at 03:16
@Peter Cordes, "No, because malloc can allocate a region larger than a CPU word. " - of cause it it. maybe you misunderstood the question, I thought (in C terms) to get pointer and go not address++ but address-- to fill the memory then. — Marisha, Sep 29 '19 at 04:16
I thought you were arguing this had some kind of connection with endianness, and little-endian machines being able to access the low byte of a small-valued integer and get the same value. On a big-endian machine, having a pointer to the last byte of the last word of an object would mean you could load a byte from that pointer and see the same value as the word, if it was small. Your justification for the question is really confusing but that's what I thought you were getting at. — Peter Cordes, Sep 29 '19 at 04:25
@Peter, maybe confusing, I'm thinking of rewording. On my "proposed" machine larger address would be "first" for any length - be it bit, byte, word etc. (you: "having a pointer to the last byte of the last word"). Those words "high, first, last" are subjective IMHO for static picture, we can talk about larger/smaller or/and most/least significant only, "last" have meaning for operation in time, like reading many bytes in sequence. — Marisha, Sep 29 '19 at 05:25
@peter, I rephrased my post, please see if it is clear that way. — Marisha, Sep 29 '19 at 05:36

score 3 · Answer 1 · answered Sep 29 '19 at 03:52

If I understand your question as "could CPUs and software be made to go from highest address down instead of lowest up?", the answer is yes. It's done the way it is based on human convention of starting at 0, but there are exceptions.

For example in most systems like Unix, the program stack extends from top to bottom, while in Multics it's from bottom to top. The Multics idea was that if code wrote past the end of an array or structure, it would write off into empty stack space, and not overwrite the stack values which would be at a lower address, while in Unix the stack values at higher addresses are overwritten and will crash on return or allow a security exploitation.

Starting at 0 seems reasonable for older systems without memory mapping, where you can't be sure how much memory is installed, and so what the highest valid memory address is. For systems with memory mapping, there was no reason to change that convention.

thank you. "It's done the way it is based on human convention" - as there are big and little endianness, I thought maybe there are (were) computers that address memory from large to small address too. — Marisha, Sep 29 '19 at 04:01

score 3 · Accepted Answer · answered Sep 29 '19 at 06:28

Normally there's zero connection between endianness within a word and what order you access words in. The reasoning / benefit / etc. that motivates choices for endianness within a word doesn't apply at all to how you index arrays.

e.g. Intel invented (or at least used) little-endian to make 8008 more like a CPU with bit-serial ALU and shift-register storage that it wanted to be compatible with. (Why is x86 little endian? and see also https://retrocomputing.stackexchange.com/questions/2008/were-people-building-cpus-out-of-ttl-logic-prior-to-the-4004-8080-and-the-6800/8664#8664 Apparently Datapoint had wanted Intel to build a bit-serial machine and storing the jump-target in LSB-first order was kind of to keep them happy even though the CPU ended up not being bit-serial.)

This obviously has no relevance when doing separate accesses to separate words.

The "advantage" cited by wikipedia is more like a "fun fact", not something that's really worth anything. Bending an ISA out of shape to get it makes no sense when it makes anything else worse or more expensive, or even just harder for humans to work with. Only if you're building a CPU that decodes instructions a byte at a time or something, and can overlap fetch with decode if decode was going to be multi-cycle anyway (because carry propagates from low bits to high bits).

Although you could have made the same argument about building the first little-endian CPU in the first place, when people considered big-endian to be "natural" at the time.

Your proposed design would make the address of a word be the address of its least-significant byte. (I think).

That's more like little-endian with everything about memory addressing reversed/flipped/negated.

Otherwise it's just a software convention to return a pointer to the one-past-the-end of an allocation, which is obviously less convenient because it requires an offset to use. But if you return a pointer to the last word of an allocation, how do you know the caller wanted to treat it as words instead of bytes? malloc returns a void*. If you return a pointer to the last byte of an allocation, you have to do math to get a pointer to the last word.

So unless you do reversed-little-endian, returning anything other than a pointer to the first (or only) byte/word/doubleword/float/whatever of the allocated buffer is obviously worse, especially given an allocator like malloc that doesn't know the element size its caller is going to use to access the memory.

C's machine model is barely compatible with a reversed-little-endian system, I think. You'd want arr[i] to mean *(arr - i) instead of arr + i, and indexed addressing modes would probably support - instead of +. Then arr[i] can work transparently with a malloc that returns a pointer to the end. But C defines x[y] in terms of *(x+y), and there is code that would notice the difference and break.

Or else you'd want to count a negative index up towards zero to loop from low to high addresses, if addressing still worked like normal?

If your "normal" use case was for(i=0; i<n ; i++) and accessing arr[-i], that could work sort of the same as on a normal machine. But then you need to modify your C source to make this work on such a machine.

Or if you wanted to write loops like for(i=0 ; i>=-n ; i--) then you your largest index becomes negative while your size is still positive. This just seems much more confusing.

(@Alexei Martianov's answer raises a good point: the CPU would probably need a binary subtractor inside address-generation units and other places where normal CPUs use an adder. I think a subtractor typically requires slightly more hardware than an adder. This is outside the main ALU, which of course has to be able to do both to support efficient integer math.)

"That's more like little-endian with everything about memory addressing reversed/flipped/negated."- yes, that is about what I thought. Thank you for very detailed answer and links! — Marisha, Sep 29 '19 at 08:41

score 2 · Answer 3 · edited Sep 30 '19 at 02:39

2

As far as I know, addition operation is more easily done by CPU then subtraction, therefore it is more efficient/optimal to go from lower to higher memory, not vice verssa.

P.S. Substruction is usually inversion plus addition:Does a subtraction take longer than an Add in a CPU?

edited Sep 30 '19 at 02:39

Marisha

816
9
14

answered Sep 29 '19 at 02:47

Alex Martian

3,423
7
36
71

1

`sub` and `add` instructions always have the same performance across all CPUs; I've never heard of an exception to this rule. The necessary trickery to implement subtraction in an ALU is done inside the ALU during the same clock cycle that does the operation. Most of the complexity is in handling the carry from low to high bits, e.g. carry-lookahead or carry-select to do a wide addition without a huge number of gate-delays on the critical path for that pipeline stage. No CPU actually decodes `sub` as two separate ALU operations, like `neg` and `add`. – Peter Cordes Sep 29 '19 at 03:12
@Peter Cordes, not for `sub reg, reg` on Sandybridge chips! See https://randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/ – LegendofPedro Sep 29 '19 at 03:39
@LegendofPedro: Ok sure, the `sub same,same` zeroing idiom is special cased to not even need an ALU execution unit). (4/clock front-end bottleneck instead of 3/clock ALU port back-end bottleneck on SnB/IvB; see [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](//stackoverflow.com/q/33666617)). But that's not relevant for address calculation / loop vars, only for zeroing a register. Any time the `sub` uop isn't eliminated in the front end, it performs identically to an `add` uop. – Peter Cordes Sep 29 '19 at 03:49
@Peter, you write "necessary trcikery" - that is what I thought when answering - trickery takes at least some electricity power I assume and therefore going from large to small address is suboptimal for computing inside CPU. – Alex Martian Sep 29 '19 at 04:05
2

In an ALU that *can* do both, I wouldn't assume that `add` actually costs less power. The implementation might do the preprocessing both ways and then select with a 2:1 muxer, so the work happens anyway. Or it might avoid the preprocessing by internally propagating "borrow" signals instead of carry. IIRC, Intel uses the same ALU for multiple operations and modifies what it does with control lines that affect every bit-position. (e.g. blocking carry to make it do XOR instead of ADD). – Peter Cordes Sep 29 '19 at 04:19
1

Anyway, even if there was a power cost difference, it's negligible compared to the rest of the ALU, let alone the whole CPU pipeline tracking instructions from fetch/decode to retirement. (Especially in an out-of-order design, but this decision was made way before that, on simple in-order machines. Predating RISC even, and many CPUs that old were barely pipelined, IIRC.) But still, I don't think this was a factor. People basically didn't care about software choices consuming more or less power until recently, with mobile devices. Just HW design to handle the worst case. – Peter Cordes Sep 29 '19 at 04:22
Now that the OP has clarified, I think you might have a point. If they want the address of a word to be the address of its least-significant byte, they would need subtraction internally in address-generation; Addressing modes would typically support `-` instead of `+` for indexing an array. So currently in places where only an adder is needed, instead you might need subtractors (*instead* of adders). I had previously been reading the question as purely a software choice on existing big-endian machines but it's more fundamental. – Peter Cordes Sep 29 '19 at 05:54

Why even big-endianness computers read from lower to higher memory? For big-endianness opposite could be more optimal

3 Answers3