1

I was debugging critical code with IOs and I came across a dilemma :

What's the quickest between those two functions ?
In which function will my CPU spend less time ?

A : CPU reads a peripheral register and writes in peripheral register

void d_toggle_pin(void)
{ 
  NRF_P1->OUT ^= 1 << Debug_Pin; 
}

B : CPU reads a RAM variable and writes in peripheral register + writes a RAM variable

void d_toggle_pin(void)
{ 
  static byte pin_state = 0;
  if(pin_state) 
  { 
    NRF_P1->OUTCLR = 1U << Debug_Pin;  
    pin_state = 0;
  }
  else
  { 
    NRF_P1->OUTSET = 1U << Debug_Pin;  
    pin_state = 1;
  }
}

I am working with nrf52840 (cortex M4 CPU) but perhaps the answer is the same regardless of the implementation

GabrielT
  • 397
  • 6
  • 17
  • Seems like a premature optimisation let Kelly to result in an insignificant difference - unless you have intend to toggle the pin very rapidly. The code generated by the compiler will depend on the optimisation level. The performance may depend on memory wait states, caching, pipelines Ning, branch prediction etc . You can inspect the generated code by instructing the compiler to generate van assembly listing or inspect in the debugger. Generally fewer instructions in the execution path = faster code, but there are other factors as I said. It would be quicker to test than to ask on SO. – Clifford Mar 02 '22 at 13:56
  • So you are saying that it depends a lot on what else is going on in the microcontroller ? For example, if a DMA uses memory buses a lot function B might be better, otherwise it's A because it produces less assembly code ? Unfortunately I can't test it at the moment, so I was wondering if there was a theoretical answer – GabrielT Mar 02 '22 at 14:26
  • 1
    With most hardware, RAM r/w is much faster than peripheral read/write, but you need to look at your hardware specs. Cache may hide RAM write times, but peripheral read/write is usually uncached. – stark Mar 02 '22 at 14:33
  • 1
    @stark Cortex M doesn't have cache and memory-mapped registers can't be cached anyway, or something is broken. – Lundin Mar 02 '22 at 14:37
  • 1
    M7 has cache, but this appears to be M4. – stark Mar 02 '22 at 14:40
  • @stark Ah yeah. This is a M4 indeed with no cache. – Lundin Mar 02 '22 at 14:51
  • 1
    If this micro optimisation were truly necessary you would at least inline the function to avoid the call overhead. The subtle semantic differences are probably more important as per @Lundi's answer. – Clifford Mar 02 '22 at 21:54

1 Answers1

3

TL;DR: the first version performs better.


The difference in terms of performance is insignificant. Cortex M3 and beyond have simple branch prediction and pipelining, but that's not going to make a whole lot of difference for this simple little code here. Sure, the 2nd version might supposedly be a tiny bit rougher on the branch predictor since those are two separate memory-mapped registers, but the difference is negligible.

In case you insist on comparing them then here's a little benchmark for gcc ARM non-eabi -O3 where I replaced the register names and made "debug pin" a hardcoded constant: https://godbolt.org/z/88vn1EqKj. The branch was optimized away, but the first version still performs slightly better.


Your top priorities here however should be functionality and readability. These two functions are both ok, but if I were to dissect them...

  • The pros of the XOR version is that XOR is kind of the idiomatic way to toggle a bit, so it is readable. You are also guaranteed that the code is always in sync with the actual register value, in case it matters.

  • The cons of the XOR version is that doing read-modify-write access of hardware registers can sometimes be problematic, since it introduces side effects and could in some cases lead to re-entrancy problems too. So rather than using the register value as a placeholder to XOR with, I think your other version that keeps track of the port separately and only performs a write access is fine for that reason.


Other things of note:

1 << ... is always wrong in C. You should almost certainly never shift a signed int, which is the type of the integer constant 1. For example 1 << 31 invokes undefined behavior. Always use 1u.

Writing wrapper functions for such a very fundamental thing like setting/clearing/toggling a GPIO pin has been done hundred times before... and nobody has ever managed to write a function wrapper that is easier to read than this:

  • reg |= mask (set)
  • reg &= ~mask (clear)
  • reg ^= mask; (toggle)

This is idiomatic, super-fast, super-readable C code which can be easily understood by 100% of all C programmers. After viewing hundreds of failed, bloated HALs for GPIO, I would confidently say that abstraction of simple GPIO can and will only lead to bloat. I've written a fair amount of such myself and it was always a mistake.

(For more complex GPIO that comes with a bunch of routing registers, interrupt handling, weird status flags etc then by all means write a HAL and a driver. But not for the sake of just doing simple port I/O.)

Lundin
  • 195,001
  • 40
  • 254
  • 396