The ARM ARM doesn't actually give much in the proper way of usage on this instruction, but I've found it used elsewhere to know that it takes an address as a hint on where to read the next value.
My question is, given a 256-byte tight copy loop of ldm/stm
instructions, say r4-r11 x 8, would it be better to prefetch each cache line before the copy, in between each instruction pair, or not do it at all as the memcpy
in question isn't both reading and writing to the same area of memory. Pretty sure my cache line size is 64 bytes, but it may be 32 bytes - awaiting confirmation on that before writing final code here.