4

The ARM ARM doesn't actually give much in the proper way of usage on this instruction, but I've found it used elsewhere to know that it takes an address as a hint on where to read the next value.

My question is, given a 256-byte tight copy loop of ldm/stm instructions, say r4-r11 x 8, would it be better to prefetch each cache line before the copy, in between each instruction pair, or not do it at all as the memcpy in question isn't both reading and writing to the same area of memory. Pretty sure my cache line size is 64 bytes, but it may be 32 bytes - awaiting confirmation on that before writing final code here.

artless noise
  • 21,212
  • 6
  • 68
  • 105
Michael Dorgan
  • 12,453
  • 3
  • 31
  • 61

3 Answers3

5

From the Cortex-A Series Programmer's Guide, chapter 17.4 (NB: some details might be different for ARM11):

Best performance for memcpy() is achieved using LDM of a whole cache line and then writing these values with an STM of a whole cache line. Alignment of the stores is more important than alignment of the loads. The PLD instruction should be used where possible. There are four PLD slots in the load/store unit. A PLD instruction takes precedence over the automatic pre-fetcher and has no cost in terms of the integer pipeline performance. The exact timing of PLD instructions for best memcpy() can vary slightly between systems, but PLD to an address three cache lines ahead of the currently copying line is a useful starting point.

Igor Skochinsky
  • 24,629
  • 2
  • 72
  • 109
  • 1
    3 cache-lines ahead... So with a copy of 256 bytes on a 64 byte cacheline, it sounds like I should go ahead and pre-fetch all 256 bytes. Hmm. Didn't think to look in the Cortex manuals. Probably should have. Thansk for the quick answer. – Michael Dorgan Jun 20 '11 at 16:59
  • 1
    Turned out my system was a 32-byte cachline line - and that our bus is SLOW. So interleaving between block instructions to hide the delay was the way to go. It also turns out that copying more than 16-bytes at a time got ahead of the cache too far and slowed things down as well. ARM can make a beautiful language, but it still requries proper memory subsystems to utilize it well. – Michael Dorgan Jun 30 '11 at 18:52
  • See http://infocenter.arm.com/help/topic/com.arm.doc.faqs/ka13544.html for actual code examples and benchmarks. – FrankH. Dec 14 '12 at 11:42
  • I found this useful and interesting: http://armneon.blogspot.mx/2013/07/neon-tutorial-part-1-simple-function_13.html – Josejulio Jul 21 '15 at 14:53
3

An example of a reasonably generic copy loop that makes use of cacheline-sized LDM/STM blocks and/or PLD where available can be found in the Linux kernel, arch/arm/lib/copy_page.S. That implements what Igor mentions above, regarding the use of preloads, and illustrates the blocking.

Note that on ARMv7 (where the cacheline size is usually 64 Bytes) it's not possible to LDM a full cacheline as a single op (there's only 14 regs you could use since SP/PC can't be touched for this). So you might have to use two/four pairs of LDM/STM.

FrankH.
  • 17,675
  • 3
  • 44
  • 63
  • You can use SP in it, you will just have to store it somewhere properly and ensure that your context switch code is not retarded enough to use SP from your execution domain (USR/SYS). – sgupta Dec 12 '12 at 06:18
  • If you "store `SP` somewhere" so you "free" the reg for use with `LDM` / `STM`, that "somewhere" usually ends up being a global location or a `PC` relative position. Both make your code non-threadsafe, and the latter requires writeable code (which isn't usually true). Acceptable in some circumstances, yet not in others. Also note that you need one register for the `LDM` / `STM` source/target address. In no case, with non-vFP/Neon, can you load an entire cacheline to regs with a single instruction on ARMv6/7. – FrankH. Dec 14 '12 at 11:33
  • Could you please explain how does that make my code thread-unsafe ? Why does it require writable code ? I thought the "adr" instruction was made for this specific purpose so that i don't have to manually deal with pc relative loads. LDR with constant is also a pc relative instruction, does that make code unsafe ? You can load/store entire cache line with ARM registers, depends on cpu cache line size. I've worked with cpu with 32 byte cache line and also those with 128 byte cache line, It all depends on the target. – sgupta Dec 15 '12 at 11:04
  • That's because an address accessed in a PC-relative way (and within the range limits of `adr`) is still _unique_ within the address space of the process (hence: a global variable), and for most values of `PC`, a location within _code_ (which is normally mapped non-writeable, at least on Linux). – FrankH. Dec 18 '12 at 22:36
  • but you can still allocate 4 bytes in .data section and store sp there. The only difference will be that assembler will allocate another 4 bytes in .text section to make space for the constant pointer to the data. – sgupta Dec 22 '12 at 04:39
  • 1
    @user1075375: You don't seem to understand the concept of _global variables_ - storage locations _common to all threads_ of a process. Their usage, without serialization/locking, is always thread-unsafe. If your code relies on _one such specific location_ then no matter how you retrieve the pointer to that, your code is not capable of running in a multithreaded process. – FrankH. Dec 27 '12 at 11:37
1

To really get the "fastest" possible ARM asm code, you will need to test different approaches on your system. As far as a ldm/stm loop goes, this one seems to work the best for me:

  // Use non-conflicting register r12 to avoid waiting for r6 in pld

  pld [r6, #0]
  add r12, r6, #32

1:
  ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
  pld   [r12, #32]
  stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
  subs r11, r11, #16
  ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
  pld   [r12, #64]
  stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
  add r12, r6, #32
  bne 1b

The block above assumes that your have already setup r6, r10, r11 and this loops counts down on r11 terms of words not bytes. I have tested this on Cortex-A9 (iPad2) and it seems to have quite good results on that processor. But be careful, because on a Cortex-A8 (iPhone4) a NEON loop seems to be faster than ldm/stm at least for larger copies.

MoDJ
  • 4,309
  • 2
  • 30
  • 65