ARM ITCM interface and FLash access

Question

If the access to the Flash memory is done starting from the address 0x0200 0000, it is performed automatically via the ITCM bus. The ART accelerator™ should be enabled to get the equivalent of 0-wait state access to the Flash memory via the ITCM bus. The ART is enabled by setting the bit 9 in the FLASH_ACR register while the ART-Prefetch is enabled by setting the bit 8 in the same register.

If i place my program code starting at 0x0200 0000, What would happen if ART accelerator is not enabled ? will it be beneficial to just use AXIM bus instead for startup code and then enable ART accelerator and point execution to program region which is at 0x0200 0000.

I am just a bit confused.

https://www.st.com/content/ccc/resource/technical/document/application_note/0e/53/06/68/ef/2f/4a/cd/DM00169764.pdf/files/DM00169764.pdf/jcr:content/translations/en.DM00169764.pdf

Page 12

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

So let's just try it. NUCLEO-F767ZI

Cortex-M7s in general:

Prefetch Unit
The Prefetch Unit (PFU) provides:
1.2.3
• 64-bit instruction fetch bandwidth.
• 4x64-bit pre-fetch queue to decouple instruction pre-fetch from DPU pipeline operation.
• A Branch Target Address Cache (BTAC) for the single-cycle turn-around of branch predictor state and target address.
• A static branch predictor when no BTAC is specified.
• Forwarding of flags for early resolution of direct branches in the decoder and first execution stages of the processor pipeline.

For this test the branch prediction gets in the way so turn that off:

Set ACTLR to 00003000 (hex, most numbers here are hex)

Don't see how to disable the PFU wouldn't expect to have control like that anyway.

So we expect the prefetch to read 64 bits at a time, 4 instructions on an aligned boundary.

From ST

The DBANK bit is set indicating a single bank

Instruction prefetch

In case of single bank mode (nDBANK option bit is set) 256 bits representing 8 instructions of 32 bits to 16 instructions of 16 bits according to the program launched. So, in the case of sequential code, at least 8 CPU cycles are needed to execute the previous instruction line read.

So ST is going to turn that into a 256 bit or 16 instructions

Using the systick timer. Am running at 16Mhz so flash is at zero wait states.

08000140 <inner>:
 8000140:   46c0        nop         ; (mov r8, r8)
 8000142:   46c0        nop         ; (mov r8, r8)
 8000144:   46c0        nop         ; (mov r8, r8)
 8000146:   46c0        nop         ; (mov r8, r8)
 8000148:   46c0        nop         ; (mov r8, r8)
 800014a:   46c0        nop         ; (mov r8, r8)
 800014c:   3901        subs    r1, #1
 800014e:   d1f7        bne.n   8000140 <inner>

00120002

So 12 clocks per loop. Two prefetches from ARM, the first one becomes a single ST fetch. Should be zero wait state. Note the address this is AXIM

If I reduce the number of nops it stays at 0x1200xx until here:

08000140 <inner>:
 8000140:   46c0        nop         ; (mov r8, r8)
 8000142:   46c0        nop         ; (mov r8, r8)
 8000144:   3901        subs    r1, #1
 8000146:   d1fb        bne.n   8000140 <inner>

00060003

One arm fetch instead of two. Time cut in half, so the prefetch is dominating our performance.

08000140 <inner>:
 8000140:   46c0        nop         ; (mov r8, r8)
 8000142:   46c0        nop         ; (mov r8, r8)
 8000144:   46c0        nop         ; (mov r8, r8)
 8000146:   46c0        nop         ; (mov r8, r8)
 8000148:   3901        subs    r1, #1
 800014a:   d1f9        bne.n   8000140 <inner>

000 (zero wait states)

00120002

001 (1 wait state)

00140002

002 (2 wait states)

00160002

202 (2 wait states enable ART)

0015FFF3

Why would that affect AXIM?

so each wait state adds 2 clocks per loop, there are two fetches per loop so perhaps each fetch causes st to do one of its 256 bit fetches, that seems broken though.

Switch to ITCM

00200140 <inner>:
  200140:   46c0        nop         ; (mov r8, r8)
  200142:   46c0        nop         ; (mov r8, r8)
  200144:   46c0        nop         ; (mov r8, r8)
  200146:   46c0        nop         ; (mov r8, r8)
  200148:   3901        subs    r1, #1
  20014a:   d1f9        bne.n   200140 <inner>

000

00070004

001

00080003

002

00090003

202

00070004

ram

00070003

So ITCM alone, zero wait state, ART off is 7 clocks per loop for a 6 instruction loop with a branch. seems reasonable. For this tiny test turning on ART with 2 wait states puts us back at 7 per loop.

Note that from ram this code runs at 7 per loop as well. Let's try another couple

I didn't look for other branch predictors other than the BTAC

First thing to note you don't want to ever run an MCU faster than you have to, burns power, many you need to add flash wait states, many the CPU and peripherals have different max clock speeds so there is a boundary where it becomes non-linear (takes X clock cycles at a slow clock rate, peripheral clock = CPU clock, there is a place where N times faster is NX clocks to do something, but one or more boundaries where it takes more than NX to do something when the CPU clock is N times faster). This particular part has this non-linear issue. If you are using libraries from ST to set the clock then you are possibly getting worst case flash wait states, where if you set it up and read the documentation you might be able to shave one or two/few.

The Cortex-M7 has optional L1 caches, didn't mess with it this time around but ST had this ART thing before these came out and I believe they defeat/disable the i cache at least, would it make it better or worse to have both? If it has it then that would make the first past slow then the remaining possibly faster even in AXIM space. You are welcome to try it. Seem to remember they did something tricky with a strap on the processor core, it wasn't easy to see how it was defeated, and that may not be this chip/core but was definitely ST. The M4 doesn't have a cache so it would have to be an M7 that I messed with (this one in particular).

So the short answer is the performance isn't that horrible if you leave off the ART and/or run out of AXIM. ST has implemented the flash such that the ITCM interface is faster than AXIM. We can see the effects of the ARMs fetch itself if you enable branch prediction you can see that as well if you turn it on.

It shouldn't be difficult to create a benchmark that defeats these features, just like you can make one that makes the L1 caches (or any other cache) hurt performance. The ART thing like any other cache makes performance less predictable and as you change your code, add a line, remove a line the performance can jump anywhere from no change to a lot as a result.

Depending on the processor and fetch sizes and alignments your code performance can vary by adding or removing code above the performance-sensitive part(s) of the project, but that depends on some factors that we rarely have visibility into.

Hard to tell looks like they are claiming that ART reduces power. I would expect it to increase power having those srams on/clocked. Don't see an obvious how much you save if you turn off the flash and run from ram. The M7 parts are not really meant to be the low powered parts like some STM32L parts where you can get to ones/tens of microamps (micro not milli, been there done that).

The small number of clocks 0x70004 instead of 0x70000 have to do with some of the fetching overhead be it ARM or ST or a combination of the two. To see memory/flash performance you need to disable as much of the features like branch prediction, caches that you can disable, etc. Otherwise, it's hard to measure performance and then make assumptions about what the flash/memory/bus is doing. I suspect there are still things I didn't turn off to make a clean measurement, and/or can't turn off. And simple nop loops (tried other non-nop instructions, didn't change it) won't tell you everything. Using the docs as a guide you can try to cache-thrash the ART or other and see what kind of hits that take.

For performance-critical code you can run from RAM and avoid all of these issues, I didn't search for it but assume that these parts SRAM can run as fast as the CPU. The answer isn't jumping out at me, you can figure it out.

Note my test actually looks like

    ldr r2,[r0]
inner:
    nop
    nop
    nop
    nop
    sub r1,#1
    bne inner
    ldr r3,[r0]
    sub r0,r2,r3
    bx lr

where sampling of systick is just in front of and back of. Before the branch. To measure ART you would want to sample the time before the branch for a memory range that has not been read it is not magically possible to read that faster the first read into the cache should be slower. If I move the time sampling further away I can see it go from 0x7000A to 0x70027 for 0 to 15 wait states with ART on. That is a noticeable performance hit for branches into code that has not been run/cached yet. Knowing the size of the art fetches, should be easy to make a test that hops a lot and the ART feature starts to not matter.

Short answer, the ITCM is a different bus interface on the ARM core, ST has implemented their design such that there is a performance gain. So even without ART enabled using ITCM is faster than AXIM (likely an ARM bus thing not ST flash thing). If you are running fast enough clock rates to have to add wait states to the flash then ART can mostly erase those.

Do I read these numbers correctly (00F/00230007/ 20F/00070004), that disabling ART caused 5x performance degradation on your simple loop with 15 wait states? This would be an important result for those who use an F7 at 216 MHz. — A.K., Feb 03 '20 at 13:26
No if you look at the table at 216 you are either 7,8, or 9 wait states depending on your voltage range. (something a library can tell so I assume 9). At the same time yes you have to add wait states to go fast, but compared to AXIM, ITCM is still going to be noticeably faster one would assume based on documentation and above experiment. 15 was done for demonstration purposes. the processor will consume those 4 or 8 instructions at a very high rate compared to going slower with less wait states. and then there is maybe an l1 cache too... — old_timer, Feb 03 '20 at 13:55
yes using 15 demonstrated the ART, but even with the ART I demonstrated it isnt s magic as implied, you dont get effectively 0-wait state performance, only on code that is re-run, I would expect it to be on par with an L1 cache, the ART is basically an L2 cache, it might be faster than L1 (if L1 is implemented in these parts) if ST designed it in a way for that, they control ART and the flash geometry/interface. — old_timer, Feb 03 '20 at 13:59
From the question it doesnt make sense (to me) to start on AXIM then switch just because ART is off. It seems like you can just turn on the ART whenever, its when you play with changing nDBANK while running that there are procedures for stopping/clearing the ART. If you want to use the ART then put the performance sensitive code in that address range and enable it. If you run without it still outperforms the AXIM. If the flash is big enough any code that doesnt fit within ITCM would then need to be in the 0x08000000 range and pic or built for that space. — old_timer, Feb 03 '20 at 14:03

score 1 · Answer 2 · answered May 27 '20 at 13:39

I THINK that the question is much simpler than what the other answers assume.

If you are thinking about doing stuff like put your program elsewhere than simply in flash: Don't. As ST says: with ART the performance will be very close to "zero wait state". So don't worry about it. Anything else you try to do is not going to be faster than that.

A.K. · Answer 3 · 2020-02-03T17:20:57.110

Q. If i place my program code starting at 0x0200 0000, What would happen if ART accelerator is not enabled?

A. Program execution (instruction fetch and constant access) will be painfully slow, with a crazy number of wait cycles (15?).

[ UPD. I have to correct that this applies more to configurations with high clock frequency, e.g. 15 wait states are needed for 216 MHz. With lower frequencies, the flash access penalty will be less significant, and minimal at 16 MHz. We do not know what frequency is used by O.P. ]

[ UPD2. At most 9 wait states are needed at 216 MHz, sorry. ]

Q. Which bus is preferable for flash code access, AXI or ITCM?

A. The voluminous document you referred to includes some performance measurements, that also compare various code placement options. The results somewhat differ between processors models because cache sizes and bus width are different. Your code will likely be affected differently. My take away from this paper is, unless your code is performance critical, both options work reasonable well. However, having two parallel busses with caches enables you to do creative things like partitioning your code into pieces and allocating them to separate busses so that the critical but rarely used code is not evicted from the cache. I mean, if you really need that.

ARM ITCM interface and FLash access

3 Answers3