Possible to mul r1,r1?

Question

If I have

movmr x,r1

Is it possible to do?

mul r1,r1

As in (x*x). I'm trying to efficiently do this to save bytes but this is the best possible solution I can think of so far and can't seem to find if it's allowed.

The whole equation is (x+y)(x-y) and so i reduced it to x^2 - y^2.

Additionally if you were wondering, f+d /exe is based on per byte.

OPC = 8bits, x/y = 20bits, reg = 3bits. So movmr x,r1 is 4f+d and 4 exe

Edit: We're using a linux-based system

Decide if it's "mips" or "x86-64" or something else. But generally speaking, yes, it is usually allowed to multiply by itself. — Jester, Nov 03 '17 at 18:45
Choose exactly one instruction set or show me the CPU that executes both MIPS and x86-64 instruction. — fuz, Nov 03 '17 at 18:54
The coding we're doing for assignments is x86-64, but we haven't really done any thing like this in the code except for learning of it's format and the byte usage. Based on the x86-64 logic, it should be something possible but there weren't really any examples i could find. So that's why I was asking, thanks again Jester. — Lawdevo, Nov 03 '17 at 18:54
Note that x86-64 does not have a `r1` register unless you aliased it to something. It also doesn't have a `mul` instruction that takes 2 operands. — Jester, Nov 03 '17 at 18:54
Your question as written is meaningless nonsense. Neither the x86-86 nor the MIPS instruction sets have a MOVMR instruction. It might make sense to someone taking your course, but otherwise I don't think anyone is going to help you. You should try asking your teacher for help. — Ross Ridge, Nov 03 '17 at 19:05
I msged the instructor but he never replies, his notes are obscure and doesn't state specifically. That's why i have to resort to asking online, and there hasn't been any office hours i can attend personally. I'm sorry if i'm not able to provide enough information regarding my question. All i know is it's just memory->register for movmr — Lawdevo, Nov 03 '17 at 19:08
What is the teaching institution and what is the course number? — Michael Petch, Nov 03 '17 at 22:35
*"We're using a linux-based system"* that's still too vague, you can have linux on x86-64 (ordinary desktop PC), or on ARM (Android). While in high level language, working with linux API, both should work in the same way; in assembly that's completely two different worlds, with everything different. There are some efforts to make it more general, like "Go Assembler", whom creator is completely confused and wrong (IMO going by listening to his talk), thinking the differences between platforms are just minor and can be masked out by something like that. Rather avoid things like that. — Ped7g, Nov 05 '17 at 08:39

Peter Cordes · Answer 1 · 2017-11-03T20:57:26.863

Most ISAs don't have this kind of restriction, and any that do will document it.

Normally instructions read all their input operands before writing any of their output operands, so it's fine if they overlap. Any restrictions will always be documented in ISA manuals / instruction-set references.

You usually only find restrictions with instructions that write more than one register, in which case unpredictable behaviour or an illegal instruction exception is normal when you give the same register for two outputs. For example, AVX512 vpgatherqq:

The instruction will #UD fault if the destination vector zmm1 is the same as index vector VINDEX.

The AVX2 version doesn't mention this in the ISA ref manual, but I forget if there's a rule against it anywhere else.

One case where it is illegal is ARM: MUL Rd, Rm, Rs does Rd := Rm × Rs

In early ARM versions(?), the behaviour is unpredictable if Rd and Rm are the same register. (ARM wiki, and some version of official ARM docs). Perhaps early microarchitectures did some kind of multi-step micro-coded calculation and accumulated the result in the destination register.

MUL     r1,r1,r6    ; incorrect: Rd cannot be the same as Rm
MUL     r1,r6,r1    ; correct:  r1 *= r6

A later version of ARM documentation doesn't mention this restriction, so I guess doesn't apply to later architectures? Or google isn't finding good ISA docs. These seem to be docs for ARM's assembler. It's certainly likely that later ARM architecture versions don't have the restriction, but IDK why later docs don't mention when the restriction was removed.

davespace says that it's Rs and Rm (the two source operands) that can't be the same. That doesn't match what any other docs say, and makes less sense microarchitecturally, so I think it's wrong.

There's also a restriction on ARM's 32x32 => 64 bit full-multiply umull Rhi, Rlo, Rm, Rs: Rhi, Rlo, and Rm all have to be different registers.

UMULL  r1, r0, r0, r0     ; unpredictable, Rlo and Rm are the same. 
UMULL  r2, r1, r0, r0     ; r2:r1  =  r0*r0

The whole equation is (x+y)(x-y) and so i reduced it to x^2 - y^2.

That transformation makes it more expensive, not less, in the absence of any surrounding code. add/sub are cheaper than multiply: better throughput and lower latency. On x86, given x and y in registers, you'd do

; x=eax
; y=edx

lea  ecx, [rax + rdx]     ; x+y
sub  eax, edx             ; x-y
imul ecx, eax             ; (x+y) * (x-y)

4 cycle latency on Intel SnB-family. (3-cycle imul, and lea/sub can run in parallel. http://agner.org/optimize/). vs.

imul  eax, eax
imul  edx, edx
sub   eax, edx

This has 5 cycle latency if eax and edx are ready at the same time. No existing x86 CPUs have more than 1 scalar multiply execution unit, so there's a resource conflict: the 2nd imul has to wait a cycle before it can execute. Depending on the surrounding code, port1 might not be a throughput bottleneck, and maybe one or the other of the inputs are ready a cycle earlier anyway.

However, if x or y is invariant, you can compute a new (x+y) * (x-y) more cheaply this way with just 2 instructions, CSEing the square that doesn't change.

This destroys both inputs, so if you need x or y after this you need a mov. The other version preserves y (in edx) and leaves x-y in a register.

What do you mean by micro-architecture? arms are not microcoded, that is generally a CISC thing. thus the point of RISC... — old_timer, Nov 03 '17 at 22:40
@old_timer: https://en.wikipedia.org/wiki/Microarchitecture. The internal implementation of the externally-visible ISA. Two CPUs could be exactly compatible for software (implement the same architecture) but have a different internals (different microarchitecture). Thanks for confirming that it was a thing in earlier ARM and gone now. I didn't know what to google to find more authoritative stuff than ARM's zillions of copies of the same documentation. — Peter Cordes, Nov 03 '17 at 22:52
@old_timer: Totally unrelated, but [some ARM instructions are micro-coded](https://superuser.com/questions/934752/do-arm-processors-like-cortex-a9-use-microcode). A classic example is LDMIA (pop up to 16 registers). [As David Kanter points out](https://www.realworldtech.com/arm64/4/), handling interrupts or faults during its execution basically requires a ucoded implementation of it, and this is why AArch64 dropped it for just load/store pair. To really avoid microcode, you need to really aggressively simplify like MIPS does. This is not necessarily a good thing. — Peter Cordes, Nov 03 '17 at 23:01
we have had this discussion, it in no way requires that a state machine works just fine if not better as there is far less overhead. The instruction didnt make too much sense so no reason to keep it when having a do over. Didnt have the value one might have expected when inventing it. cool instruction but bad side effects. — old_timer, Nov 03 '17 at 23:08
saying that though obviously with two instruction sets being decoded into the same pipe one is translating to another, the original thumb is obvious, the printed the translation in the manual, how do they do it to day? dont know but clearly it is there. — old_timer, Nov 03 '17 at 23:15

Possible to mul r1,r1?

1 Answers1