Should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

Question

For the CC3220S manufactured by Texas Instruments, I developed a function in the C programming language which uses inline Assembly to wait 1 second (excluding the instructions before the loop and outside the loop). According to the ARMv7-M reference manual, the MOV instruction which targets the PC takes 1 + P instruction cycles where P is between 1 and 3 depending on a pipeline refill. Worst case this means that the loop executes in 6 clock cycles.

The CC3220S its clock speed is 80 MHz. However, executing the loop 10 million times creates the desired delay of 1 second (verified with a logic analyzer). This means that the loop uses 8 clock cycles. I have my doubts about the amount of clock cycles the instruction uses. Hence my question, should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

void delay_1sec(void)
{
    __asm("    PUSH {r4-r5,lr}");  

    __asm("    LDR r4, [pc, #12]"); 

    __asm("    MOV r5, pc");        
    __asm("    NOP");               

    __asm("    SUBS r4, #1");   /* 1 instruction cycle */ 
    __asm("    ITE NEQ");       /* 1 instruction cycle */ 

    __asm("    MOV pc, r5");    /* 1 + P instructions (where P is between 1 and 3 depending on                   pipeline refill) */ 


    __asm("    POP {r4-r5,pc}"); 
    __asm("    .word    10000000"); 
}

IDK whether to expect all ARMv7-M CPUs to have the same performance; seems unlikely. But separate from that: If you're going to write the whole body of a function in inline asm, *including a return instruction* (pop into PC), make it `__attribute__((naked))` so it can't inline into other functions and break them. Also, prefer one large `asm() statement. Although inside a `naked` function, this is safe. But really this is total overkill; just ask the compiler for `10000000` in a `"+r" (var)` register and another `"=r"` dummy output in a GNU C Extended asm statement. — Peter Cordes, Dec 07 '19 at 18:17
This program will not be compiled using the a GNU compiler but a TI compiler. — Xhendos, Dec 07 '19 at 18:19
Gah, don't use inline assembly like this. Just use a separate assembly source file, so you don't have to have all this `__asm("...");` nonsense and don't have to worry about the compiler inserting whatever instructions it wants. — Ross Ridge, Dec 07 '19 at 18:20
Then use whatever syntax it supports for naked functions. Or does the TI compiler not even support function inlining? Wouldn't this break if it inlines into some other function and returns from *it* (using whatever happens to be in LR at the time)? Really I'd suggest the same as Ross: separate asm file if you can't use GNU C Extended asm to make a version of this that can inline. — Peter Cordes, Dec 07 '19 at 18:21
Thanks for the suggestions but this is going too much off-topic from the original question. This is not any production code but more a prove-of-concept. — Xhendos, Dec 07 '19 at 18:25
@Xhendos It is hard to rule out that the compiler inserts instructions on its own the way you wrote your code. — fuz, Dec 07 '19 at 18:56
I am 100% sure the compiler does a function call to delay1_sec and returns to the caller because I checked every single instruction and single stepped through the instructions in run time. — Xhendos, Dec 07 '19 at 19:01
ARMv7-M is an architecture, I cannot find any cycle counting in it. The page you linked is about Cortex-M4 (which implements ARMv7-M and is sold as an IP). I'd say that all unadulterated ARM Cortex-M based on ARMv7-M will share the same core implementation and thus have the same cycle counting. However that doesn't have to be always true everywhere. You also need an 80 MHz clock for your code to work (which further restrict the set of uProcessors). ARMv7-M encourages single or low count cycle instructions, so the variation among totally different impls shouldn't be a lot, yet still noticeable. — Margaret Bloom, Dec 07 '19 at 19:08
@Xhendos: My comments about how to use inline asm aren't intended as an answer to your question; that's why I posted them as a comment not an answer. It's a separate note about how (not) to use inline asm so your code won't break with optimization enabled. — Peter Cordes, Dec 07 '19 at 19:46
depending on the core it wont meet arms own documentation depending on compile time options available to the chip customer. cycle counting died when pipelines started being used, why do you care about cycle counts? mcus are not going to meet those numbers except in rare cases. no. and you do not use a loop like this for a delay on a pipelined or other similar design processor, you can for a PIC and some with fixed times, but it is a waste of time to even attempt to tune a loop like this. — old_timer, Dec 07 '19 at 19:55
use the systick timer or one of the chip vendor timers, esp for a one second delay — old_timer, Dec 07 '19 at 19:56
hmm ti docs are usually better. the sram is zero wait state so they claim, doesnt mean you will get the cycle counts you expect. the flash has a 128 bit prefetch buffer which can help or hurt tight loops like this depends on how it is designed, usually these can be detected with experiments. the 64 bit prefetch on the m7 can make tight loops like this 20% slower depending on alignment. cant see where they state the wait states for the flash, would assume many and advertising a prefetch futher re-inforces that which means you wont meet those performance numbers from arm. — old_timer, Dec 07 '19 at 20:16

artless noise · Accepted Answer · 2019-12-07T19:43:45.510

From your reference,

The cycle counts are based on a system with zero wait states.

From your source the loop is,

SUBS r4, #1   /* 1 cycle */ 
ITE NEQ       /* 1 cycle */ 
MOV pc, r5    /* 4 cycles */

Assuming the compiler inserts no additional code, your memory can be 2 wait states when re-filling the instruction pipeline. Also, a vendor may modify the core and doesn't need to fulfill this timing requirement. Some vendors licence the 'architecture' and design the logic to implement the instruction set. Other buy a logic block that implements the Cortex-M4. I would guess TI is the later and that the memory wait-states are your issue. You didn't note which memory device your code is located in. If you system uses the 'serial flash' a two wait state additional delay would not be surprising at all. This would bring the cycle count to 8 which is what you observe.

Hence my question, should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

From above the answer is NO. If they are an architecture licensee the cycle counts maybe different. They need to be binary compatible (but even this is not always the case). However, in your case, I believe they are meeting the document it just needs to be fully applied to the use case by calculating the memory wait states. The on-board SRAM could also have wait states. Typically only TCM is zero wait state.

Should a semiconductor manufacturer buying IPs from ARM meet the clock cycles for an instruction described in the reference manual?

1 Answers1