1

So I'm trying to understand how ARM7 processors work; I can't wrap my head around the MLA instruction.

Multiply instruction

ARM7 register bank has only 2 outputs other than pc_read. How can it read Rm and Rs and Rn at the same time to perform a multiply-accumulate instruction?! (Rd:=Rm*Rs+Rn)

Could you walk me through the process of how it works step by step from fetch to writing back to Rs?

Will
  • 29
  • 3
  • depends on the depth of your pipe and the speed of your multiply. saving an instruction may or may not be shadowed by the operations... – old_timer Jul 01 '22 at 18:43
  • @old_timer That sounds like reading the third operand while the multiplication is taking place. Is that possible? If it's a simple MUL instruction, does it still go through the ALU and then get added by 0 or something? – Will Jul 01 '22 at 20:18
  • what do you mean how is it possible. How deep is your pipe, what do you do in each stage, etc? – old_timer Jul 01 '22 at 20:57
  • are you implementing your multiply in one clock or multiple? – old_timer Jul 01 '22 at 20:57
  • 1
    obviously you either need to do more in one stage in the pipeline or you need to have more stages...which have you chosen to do? – old_timer Jul 01 '22 at 20:58
  • 1
    3-input instructions presumably require an extra read port from the register file (or bypass forwarding), or would take multiple cycles in some stage. Why specifically ARM7, an old ARMv3 or ARMv4T microarchitecture (https://en.wikipedia.org/wiki/ARM7)? Just because it's older and simpler? Simpler even than a Cortex-M0? Or you wanted to implement only ARM mode. – Peter Cordes Jul 02 '22 at 01:20
  • if you were building two door cars on your assembly line and you then add a four door model. you have to add something. More than one way to solve it. But it will not fit into the textbook version of a pipeline. – old_timer Jul 02 '22 at 14:42
  • @old_timer I'm not trying to find a way to make my pipeline work but I'm rather trying to understand what ARM7(TDMI) is doing and kinda replicate it. The problem is that I don't understand what ARM7 is doing in this instruction. **An answer that I'm looking for would be a sample MLA instruction and an explanation of how it goes through the pipeline step by step until writing the output back into the Rd**. – Will Jul 03 '22 at 15:35
  • @PeterCordes Yes. An extra read port would solve all the problems without any headache and it's simple to add to a Verilog project but I'm not sure if that would be how an ARM7 processor would do it. The task is to understand and implement ARM7 because it's both old enough and new enough. The thing is that I'm knees deep into ARM7 and can't find any reference to understand how this MLA instruction works :) I can just drop everything and implement something else or work my way around the problem but I want to understand how it would actually work on an ARM7 processor (tbc in next comment) – Will Jul 03 '22 at 15:40
  • ... Like, if it is actually better than MUL + ADD we can optimize matrix multiplications on this device to a good degree but if it's kinda the same and needs extra cycles to read the third operand it wouldn't be that much faster (despite processors with 3 read ports). Again, I'm not trying to solve a problem of my implementation not being able to do MLA instruction, but I'm rather trying to understand how ARM7 works by doing this project. @PeterCordes If you have any suggestions to make my question clear it would be greatly appreciated :) – Will Jul 03 '22 at 15:45
  • 1
    ARM official manuals should have performance numbers in cycles for each instruction. I'd assume on the real commercial implementation, it *is* more efficient than separate mul+add, not stalling. – Peter Cordes Jul 03 '22 at 20:27
  • 1
    @Will: We're software developers (not hardware/CPU designers). I'd be surprised if anyone here knows internal details of ARM7's implementation, and I'd expect that anyone who has access to the information you're asking for is an official ARM licensee that would prefer to talk to you about the consequences of potential patent infringement. – Brendan Jul 03 '22 at 21:59
  • 1
    @Will: For CPU designers, you might (or might not) have more luck at https://www.realworldtech.com/forum/?roomid=1 – Brendan Jul 03 '22 at 22:06
  • Found this course somewhere in the internet's lost and founds :) https://faculty.cc.gatech.edu/~hyesoon/spr10/lec_arm2.pdf (page 12) says that there's one additional cycle for doing accumulation. I'm guessing as the multiplication is happening (outputting on the second bus) the register bank will be occupied by the same instruction for an extra cycle to read the Rm. Sorry if this was a bad post and thank you all for wanting to help – Will Jul 04 '22 at 14:48

0 Answers0