6

There are lots of NOR QSPI FLASH chips that support XIP (eXecute In Place). In this mode the embedded cpu (or MCU) can directly execute the codes stored in the flash. But as we know, the qspi flash can only output 4-bit data per cycle, while many MCUs, such as ARM Cortex-M series, need a 32-bit instruction per cycle. So the MCU have to wait at least 8 cycles to get a valid instruction, which seems very slow. Besides, the max frequency of a nor qspi flash chip is often below 150MHz and the frequency of STM32F407 is 168MHz, which means longer delay for cpu to receive a valid instruction.

I don't know if my understanding is wrong, but I really couldn't find much details about XIP. The Techinal Reference Manuals of STM32Fxxx only say that they have embedded flash and support XIP, but they don't show any details. Besides, I guess we also need to implement a very complicated QSPI controller in the MCU to support XIP.

Can anyone give me some guidelins to this question?

HYF
  • 163
  • 1
  • 5

1 Answers1

3

As far as I know the MCU uses a buffer in RAM to read instruction from external flash there and then executes them. It reads them in chunks. Now the size of one chunk very much depends on each vendor implementation (i.e. how much RAM is availiable, how the flash is connected: SPI, Dual SPI, Quad SPI, Octal SPI, is Direct Memory Access (DMA) possible, does flash support Continuous Read Mode). So if the chunk is small then the core would stall waiting for instructions. If the chunk is large then that uses up RAM and also when branching the chunks that were already loaded into RAM would be reloaded for new code.

So lets say the flash is connected with Dual SPI and DMA is possible. Then for XiP the controller would start by executing some bootloader code (normally from some internal ROM memory. The bootloader sets up the QSPI flash controller and the core's DMA to copy instructions from external Flash to RAM buffer. Then it would start executing the code in that buffer. The DMA would now asynchronously copy instructions to RAM. This means the actual MCU core wastes almost no time in copying code.

You said that you could not find much details about XiP. Best source of info for me were the Application Notes of various manufacturers. The implementations are different but have a lot in common.

Here are 3 example documents:

user10607
  • 3,011
  • 4
  • 24
  • 31
  • Thanks, but I am still confused. I'm considering the situation that all the user application codes are stored in ```an external qspi flash```, except the bootloader codes. There is no way that the nor flash runs faster than the MCU. Many high-end MCUs such as STM32H7xx could run at 400MHz. The qspi flash could only ouput 4bits every cycle. Besides, the QSPI hardware controller must send command bytes to flash before reading data. As a result, it might cost tens of cycles for the MCU core to get a complete 32-bit instruction, which seems extremely slow. – HYF Apr 17 '19 at 16:28
  • The second doc that I added to my answer has a speed comparison and I was surprised to see that XiP is not that much behind RAM. So they must use some pretty clever schemes to get the performance on par with RAM. If you want to really understand the low level optimizations that make this possible, I think its going to require waaaay more than one SO question & answer :) – user10607 Apr 18 '19 at 11:24
  • It is just the normal instruction cache of the CPU. That is really all the magic you need to make XIP fast. – Timmy Brolin Aug 19 '22 at 22:20