4

I have this simple inline assembly code:

__asm__ volatile (

    ".equ GPIOA_ODR, 0x4001080C \n\t" //GPIOA base address is 0x40010800 and ODR offset is 0x0C


    //turns on PA8
    "ldr r1, =(1 << 8)     \n\t"        
    "ldr r2, =#GPIOA_ODR   \n\t"     
    "str r1, [r2]          \n\t"   

    //turn off PA8
    "ldr r1, =0            \n\t"        
    "ldr r2, =#GPIOA_ODR   \n\t"     
    "str r1, [r2]          \n\t"          

);

PA8 only oscillates at 2.4MHz, I want a speed of 36MHz. I have tried using timers and reached a speed of 36MHz before but because of some limitations I want to avoid using them.

I'm not understanding why TIMER1 Channel 1 (PA8) can be configured to 36MHz switching speeds, but when I try to do the same in assembly, I only reach a speed of 2.4MHz on the same pin.

I'm also setting up the pin using PinMode(PA8, OUTPUT);

I have tried other variations of this assembly code and only reached up to a maximum of 2.8MHz on PA8. My question is: Is a higher switching speed than 2.4-2.8MHz on a GPIO pin not possible on the STM32f103C8?

(This is a followup question after Need Help Manipulating Registers in Inline Assembly (STM32F103 "BluePill"))

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
SirSpunk
  • 91
  • 1
  • 8
  • there is quite a bit of overhead in general using software, your code could be more efficient if all you want is one pulse. so based on your edit on your last question, did you actually read that article? note that the stm32f1 and stm32f4 are different chips with different performance. If the stm32f103c8 has dma in front of the gpio then you can use that as that author did – old_timer Jan 12 '20 at 23:32
  • if you want to mimic the software experiment the author did then as with the author you need to understand the system better as possibly pointed out in your last question. but anyway there is no reason for the overhead you have created in your posted code. setup the registers with address and data up front, do a burst of ons and offs by using a sequence of str instructions. run this from flash, run it from ram, run it in in a loop one on one off per loop (four instructions str, str, subs bnz, all 16 bit thumb not thumb2). then try it with more pairs of strs say 4, 8, 16, 32 – old_timer Jan 12 '20 at 23:35
  • examine the output on a scope see how it behaves the first time through the loop when running from flash on an st, with the age of that stm32f103 does it have their flash cache onit? what about the subsequent loops. can you see the delay at the end of the loop (should be able to). what about a long linear run no loop, etc. how does the output compare to the system clock and the perpheral clock speeds? – old_timer Jan 12 '20 at 23:38
  • what if you use ldm and str so that you can read the data from ram then pump it into the gpio port. – old_timer Jan 12 '20 at 23:39
  • understand when you switch to a chip that is fast enough to do what you want (which is not the one you have) then you have to repeat all of this as the timing may change. – old_timer Jan 12 '20 at 23:39
  • 1
    Your inline asm still steps on the compiler's toes, modifying registers without telling the compiler that you're doing so. You also waste instructions getting an address into a register (twice for the same address!) – Peter Cordes Jan 13 '20 at 01:16

3 Answers3

10

The STM32F103C8 runs at a maximum clock speed of 72 MHz. So 36 MHz is the maximum frequency that can be generated on a GPIO as a separate clock cycle is needed to set and clear the pin. This frequency can only be achieved with a timer.

If you try the same with code, you will need at least three instructions: two stores and one branch. These instructions require about 6 clock cycles to execute and will therefore result in a maximum frequency of about 12 Mhz.

In order to achieve this in software, your code should look something like this:

while (1) {
    GPIOA->ODR = 1 << 8;
    GPIOA->ODR = 0;
}

Assembler code shouldn't be needed as the compiler will come up with the optimal code. It will look like this:

        ldr     r3, .L3
        movs    r1, #128
        movs    r2, #0
.L2:
        str     r1, [r3]
        str     r2, [r3]
        b       .L2
.L3:
        .word   1207959572

Update

I've tested it on a real world device and I'm getting a frequency of 8 MHz. My estimate was that 6 clock cycles are needed for the three instructions but it seems to require 9 cycles.

The generated code is more or less as expected:

7a:   60d9            str     r1, [r3, #12]
7c:   60da            str     r2, [r3, #12]
7e:   e7fc            b.n     7a <main+0x7a>

The scope clearly shows that all three instructions take the same amount of time.

Codo
  • 75,595
  • 17
  • 168
  • 206
  • 1
    "This frequency can only be achieved with a timer" I want to understand WHY this is the case. I understand the fundamental 36 MHz limit, but I don't understand why TIMER1 and GPIOA, while on the same bus, can't reach the same speeds. You also say 12MHz should be possible with your first example code. However with that same code I could still only achieve 2.8MHz on PA8, so what is holding me back here from 12MHz? – SirSpunk Jan 13 '20 at 00:47
  • @SirSpunk: 36MHz is obviously only possible if literally every instruction that executes is a store, because your 72MHz CPU only executes 1 instruction per cycle. There's no room for any loop overhead so you have to massively unroll, I guess? Or can it run an unconditional branch in the same cycle as a store? Anyway, the code in your question is obviously slow because you use `ldr reg, =#address` twice (where a compiler can't hoist it out of a loop), and presumably put that inside a C loop. Like people told you last question, using `volatile` will make faster code than you wrote. – Peter Cordes Jan 13 '20 at 01:21
  • 1
    @SirSpunk: Or did you mean with the code in this answer? Did you forget to enable optimization when you compiled? – Peter Cordes Jan 13 '20 at 01:21
  • @PeterCordes The problem is that I've tried "static volatile uint16 *GPIO_ODR = (volatile uint16 *)&GPIOA_BASE->ODR;" (globally defined) with "volatile uint16* odr = (GPIO_ODR); *odr = (1 <<8); *odr = 0;" in the loop function and let the compiler do its thing, still 2.8MHz. If 36MHz is not possible without using the timer than at least suggest why I can't get any higher frequencies than 2.8MHz on PA8. – SirSpunk Jan 13 '20 at 06:27
  • 1
    @SirSpunk: The reason 36 MHz can only be achieved with a timer is explained in the second paragraph of my answer. The shortest code that sets the pin to high and to low takes about 6 clock cycles. So clock frequency of 72 MHz divided by 6 cycles is 12 MHz. Therefore: hardware (timer, possibly DMA or SPI clock line) solution: 36 MHz; software solution 12 MHz. – Codo Jan 13 '20 at 07:00
  • Ah, nevermind my earlier comment, then. I thought "with a timer" meant setting up a timer that interrupted an infinite loop of store instructions, but apparently you can configure it to toggle the pin based on a hardware clock so you don't need to run store instructions at 1/clock, and that's impossible. – Peter Cordes Jan 16 '20 at 21:01
3

I might be completely off here as I do not code for your platform...

Anyway the time when GPIO where mapped to memory or registers directly is far away now. Modern MCUs have GPIO interfaces interconnected to the MCU CPU core by interface (usually memory mapped registers) where you enque GPIO commands instead of directly manipulating the GPIO bits.

Timer bypasses this interface hence the better speed. However if the case there are ways how to improve speed of GPIO polling by MCU:

  1. GPIO API clock

    API (interface) between MCU CPU core and GPIO module is usually controlled by separate clock. If set to slow speed the GPIO will be also slow regardless of the MCU clock or the GPIO capabilities.

    So try to look for it and enhance it as much as you can.

  2. GPIO groups

    the GPIO pins are usually grouped into PORTS which share the same API registers. So its usually possible to handle all pins in the same group at once with the same speed as you would handle just single pin. So if you select your pins you use carefully you can tweak the poling frequency a lot.

    So if possible use just single group... compute the operation for all pins and then use the GPIO api to set/clear/toggle all at once instead of one by one.

  3. DMA

    some MCUs allow DMA between memory and GPIO where you can bypass the GPIO API and obtain similar speed to timers. Simply create memory buffer with all the bit states precomputed ahead with some sampling rate and then use DMA to "play" it on GPIO similarly like you play a wav file on the soundcard...

  4. not using GPIO

    some MCUs are simply not build for GPIO speeds but for more computation power or different purpose in which case no matter what you do you will not improve the GPIO speed by much. In such cases the MCUs are usually equipped with interfaces for interconnecting with different HW like external memory, IDE, LCDs, SPI, USART, etc.

    Some of those can be used instead of GPIO and interfaces for external memory are usually fast and DMA capable enabling fast transfer speeds even if GPIO is too slow... For example see: VGA pixel grouping on STM32

    Just for comparison I am used to AVR32 UC3 MCUs which on ~66MHz CPU clock have ~5 MHz GPIO toggling frequency (polling)... but by using interfaces instead I can have even 33MHz samplerate ...

    The problem is that such interface usually do not have a lot of pins at disposal and also sometimes they are shared or time mapped as busses in which case you sometimes need to add some additional stuff to your HW (like diode+capacitor, or LATCH or (DE)MUX ...) to avoid glitching

Spektre
  • 49,595
  • 11
  • 110
  • 380
3

I want to answer my own question because getting a solid answer on this was frustrating and the results are NOT obvious until you do real world testing. Old_timer's comments proved to be the most helpful.

void setup(){

#define FLASH_ACR (*(volatile uint32_t *)(0x40022000))
FLASH_ACR = 0b110010; //enable flash prefetch and wait states 

pinMode(PA8, OUTPUT);

__asm__ volatile (

"ldr r0, =(0x4001080C) \n\t" //GPIOA ODR
"ldr r1, =(1<<8) \n\t" //turn on PA8
"ldr r2, =0 \n\t" //turn off PA8
".loop: \n\t"
"str r1, [r0] \n\t" // ON and OFF commands are unrolled (repeated) about 100 times
"str r2, [r0] \n\t" // inside the loop
"b .loop \n\t"

); }

With the MCU running at 72MHz I got very close to 18MHz toggle speed on PA8 using the above code. It is to my understanding that using immediate values or the XOR instruction could toggle the pin faster (among other things you could possibly do), this is because certain instructions or certain methods of coding use up less clock cycles that result in faster performance.

If you also look at the STM32f103 PDF:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=2ahUKEwjT-qaNw53nAhUZac0KHe8_Cb8QFjADegQIBBAB&url=https%3A%2F%2Fwww.st.com%2Fresource%2Fen%2Fdatasheet%2Fstm32f103c8.pdf&usg=AOvVaw0rd6I_7fuhTLdZOoycvGV5

You will see on page 20 section 2.3.21 that "I/Os on APB2 with up to 18 MHz toggling speed." so I guess I am hitting a limit there if it's mentioned in the documentation. If you also glance at page 66 you'll see a nice table with the "I/O AC characteristics" and it shows you could go up to 50MHz.

So after reaching almost 18MHz I decided to overclock the board to 128MHz and achieved almost a 32MHz toggle speed on PA8 with 1.6VDC on the pin. Now I'm satisfied, thanks for all the comments and help guys. I'm still a beginner at this but I think I'm understanding a lot of this now.

SirSpunk
  • 91
  • 1
  • 8
  • 1
    *It is to my understanding that using immediate values or the XOR instruction could toggle the pin faster* That doesn't make sense; ARM doesn't have immediate value stores, or memory-destination XOR. Back-to-back `str` instructions (with unrolling to hide loop overhead) are the only way to come close to max CPU-driven throughput. This means setting up registers *outside* the loop, like you're doing in this answer. – Peter Cordes Jan 25 '20 at 04:44
  • 1
    you do not need assembler for that https://godbolt.org/z/C2BtZe – 0___________ Jan 25 '20 at 09:24
  • @PeterCordes You're right, I made a mistake there with this specific CPU. But on different CPUs I've heard you could use different tricks like using XOR to toggle something faster inside a loop because it uses up less clock cycles. Or using immediate values can save a few clock cycles. Is that true under certain situations or am I mistaken? – SirSpunk Jan 25 '20 at 20:22