C performance on a PIC board global variables vs. method local

Question

All,

I have C functions that are called many times a second as they are part of a control loop on a PIC18 board. These functions have variables that only need method scope, but I was wondering what if any overhead there was to constantly allocating these variables vs. using a global or at least higher scoped variable. (Thought of typedef'ing a struct to pass around from a higher scope to avoid global variable use if performance dictates not using method local varables)

There are some good threads on here that cover this topic, but I have yet to see a definitive answer as most preach best practices which I agree and would follow as long as there are not performance gains to be had as every microsecond counts.

One thread mentioned using file scoped static variables as a substitute for global variables, but I can't help wonder if even that is necessary.

What does everyone think?

Either do some profiling, or at least read the generated assembly code to figure out how the accesses are expressed. — unwind, Feb 07 '12 at 15:19
What you must ask yourself is: does a few CPU ticks here or there matter to my application? If it does, then why on earth did you pick one of the least code-effective CPUs in the history of mankind for the project? It's sort of like buying a cheap VM Beetle and then insist on putting a spoiler on it. — Lundin, Feb 07 '12 at 15:48
@JohanLundberg: Profiling for 8-bit embedded systems? Are there even any tools for that? No, don't profile, use an oscilloscope. If you are writing MCU applications and don't own an oscilloscope, then program performance is the least of your problems. — Lundin, Feb 07 '12 at 15:54
@Lundin, Right - 'You should do some measurements' then. And figure out if you would actually benefit from optimizations. — Johan Lundberg, Feb 07 '12 at 16:11
@Lundin: The PIC has been surpassed over the years, but its performance is pretty reasonable for a lot of purposes. A PIC running at 40MHz can set or clear a port bit every 100ns, which is about as fast as a typical 32MHz ARM (the PIC can set or clear the port bit on any given instruction; the ARM requires setting up registers with the address of the port and the value to go there, so a direct speed comparison is difficult). — supercat, Feb 09 '12 at 02:38
@supercat So in other words, the average 8-bit MCU running at 10MHz is equally fast at port toggling as a 40MHz PIC...? However, overall _code efficiency_ is the relevant factor to look at, not some simple port toggling. I read an article a few years back when they had looked at pretty much every MCU on the market, and PIC was 2nd worst at code efficiency, only 8051 was worse. I'm sure there are uses for PIC, but if you find yourself chasing microseconds, you have to ask yourself why you are using an 8-bit, and why you are using an 8-bit PIC. — Lundin, Feb 09 '12 at 08:26
@Lundin: If one needs to rapidly test, set, and clear port bits under software control, the PIC can do that faster than a lot of architectures (oddly, the only thing I knew of that could really do much better were the Scenix parts, which were PIC clones that beat the PIC by a factor of five; I think the Scenix designs have been abandoned, though, and I have no idea why Microchip didn't buy their technology, unless they'd have wanted too much?) The 16C5x was based on a 32+-year-old design, and I liked the 14-bit upgrade. The 18xx parts were a disappointment, with lots of missed opportunities. — supercat, Feb 09 '12 at 15:40
@Lundin: Still, I think the concept of having a single default register "W" wouldn't be a bad one if there were instructions to use something else in place of "W" as the source operand for the succeeding instruction. I'd also like the FSR/IND concept, if there were addresses to deference the first few bytes off each FSR. Even as it is, though, I'm not sure how many similarly-priced micros could do a faster job of bit-banging a protocol which is similar to SPI or I2C, but which has oddball requirements that preclude the use of a normal SPI/I2C hardware module? — supercat, Feb 09 '12 at 15:49
@supercat How many programs are there that only need to do port I/O? :) And as I said, most 8-bitters (Renesas, Freescale etc) can write/read to an 8-bit port register in one clock tick, which would then be 4 times faster than your PIC example. Even the semi-ancient Freescale/Motorola 68HC05 could do that, 20 years ago. — Lundin, Feb 09 '12 at 15:54
@Lundin: It's been awhile since I've looked at the HC05, but I don't recall it ever being able to set/clear a port bit in one clock tick (or, for that matter, less than four; my recollection is that the bset/bclr instructions took five). As for how many programs just do port I/O, there are a lot of small programs which exist primarily for that purpose; the larger PICs can work well for applications which would otherwise combine a small PIC with a separate processor to run the "main application". — supercat, Feb 09 '12 at 16:25
@Lundin: A few years ago, I was very keen on the PIC. I'm far less so now. I sense a lot of "almost greatness" there, and I'd really like to see greatness emerge, but there are a lot of little shortcomings that add up badly. BTW, I wonder if any micros have a nice instruction to compute "ProdH:ProdL = Acc*Reg + ProdH"? Such an operation would be a useful analog to "add with carry", especially if one could perform repeated operations without disturbing Acc. — supercat, Feb 09 '12 at 16:32

score 2 · Answer 1 · answered Feb 07 '12 at 15:23

Accessing a local variable requires doing something like *(SP + offset) (where SP is the stack-pointer), whereas accessing a static (which includes globals) requires something like *(address).

From what I recall, the PIC instruction set has very limited addressing modes. So it's very likely that accessing the global will be faster, at least for the first time it's accessed. Subsequent accesses may be identical if the compiler holds the computed address in a register.

As @unwind said in the comments, you should take a look at the compiler output, and profile to confirm. I would only sacrifice clarity/maintainability if you've proved that it's worthwhile in terms of the runtime of your program.

I have never seen a PIC compiler use a software stack for anything except on PICs which have an FSR2+offset addressing mode (compilers for such PICs use FSR2 as a stack pointer). — supercat, Feb 08 '12 at 16:58

score 1 · Accepted Answer · answered Feb 07 '12 at 16:28

While I've not used every single PIC compiler in existence, there are two styles. The style I've used allocates all local variables statically by analyzing the program's call graph. If every possible call were in fact performed, the amount of stack memory consumed by locals would match what would be required by static allocation, with a couple of caveats (describing the behavior of HiTech's PICC-18 "standard" compiler--others may vary)

Variadic functions are handled by defining local-variable storage in the scope of the caller, and passing a two-byte pointer to that storage to the function being called.
For every different signature of indirect function pointer, the compiler generates a "pseudo-function" in the call graph; everything that calls a function of that signature calls the pseudo-function, and that pseudo-function calls every function with that signature that has its address taken.

In this style of compiler, consecutive accesses to local variables will be just as fast as consecutive accesses to globals. Other than global and static variables explicitly-declared as "near", however, which must total no more than 64-128 bytes (varies with different models of PIC), the global and static variables for each module are located separately from local variables, and bank-switching instructions are needed to access things in different banks.

Some compilers which I have not used employ the "enhanced instruction set" option. This option gobbles up 96 bytes of the "near" bank (or all of it, on PICs with less than 96 bytes) and uses it to access 96 bytes relative to the FSR2 register. This would be a wonderful concept if it used the first 16, or maybe 32, bytes as a stack frame. Using 96 bytes means giving up all of the "near" storage, which is a pretty severe limitation. Nonetheless, compilers which use this instruction set can access local variables on a stack just as fast, if not faster, than global variables (no bank-switch required). I really wish Microchip had an option to only set aside 16 bytes or so for the stack frame, leaving a useful amount of 'common bank' RAM, but nonetheless some people have good luck with that mode.

Wouldn't that make it impossible to write re-entrant functions and at the same time needlessly increase RAM consumption? — Lundin, Feb 08 '12 at 08:31
It only increases RAM consumption in cases where there are paths through the call graphs that are possible in theory but not in practice. It avoids any danger, however, that an "unexpected" path through the call graph would overflow the stack. Many embedded applications don't need re-entrancy, and given how poorly the normal PIC instruction set handles address arithmetic, link-time allocation of auto variables is a useful substitute. — supercat, Feb 08 '12 at 14:35
It does sound dangerous... I don't agree that you don't need re-entrancy: while RTOS may be rare on 8-bit applications, almost every embedded system has ISRs, and if you ever call a function from an ISR that is also called from the main program, that function needs to be re-entrant. — Lundin, Feb 09 '12 at 08:35
@Lundin: Depending upon the compiler, calling the same function from an interrupt and main-line code may result in (1) a link error, (2) the compiler generating code to make a backup of variables used by the routine on entry, and restore them on exit, or (3) two auto-generated copies of the routine--one for use by the mainline and one for use by the ISR. Generally, the solution is to avoid making subroutine calls from within the ISR. — supercat, Feb 09 '12 at 16:49

score 0 · Answer 3 · answered Feb 07 '12 at 15:46

I would imagine that this depends a lot on which compiler you are using. I don't know PIC but I'm guessing some (all?) PIC compilers will optimize the code so that local variables are stored in CPU registers whenever possible. If so, then local variables will likely be equally fast as globals.

Otherwise if the local variable is allocated on the stack the global may be a bit faster to access (see Oli's answer).

C performance on a PIC board global variables vs. method local

3 Answers3