1

I working with a SPARC V8 processor which is connected to memory with a 32-Bit data bus. From the SPARC architecture manual V8, I have learned, that there are instructions to load / store a single 32-Bit register (word), but also instructions to load / store a double word into / from 2 registers atomically. Are the double word instructions somehow faster than the single word instructions on my machine? What else than the data bus width does it depend on?

Further, I discovered a optimized memcpy implementation in the Linux kernel sources, which copies a aligned chunk as follows:

#define MOVE_BIGALIGNCHUNK(...) \
ldd     [%src + (offset) + 0x00], %t0; \
ldd     [%src + (offset) + 0x08], %t2; \
ldd     [%src + (offset) + 0x10], %t4; \
ldd     [%src + (offset) + 0x18], %t6; \
std     %t0, [%dst + (offset) + 0x00]; \
std     %t2, [%dst + (offset) + 0x08]; \
std     %t4, [%dst + (offset) + 0x10]; \
std     %t6, [%dst + (offset) + 0x18]; 

Is there any benfit from grouping loads and stores together? Just curious.. Thanks!

Update: I'm using Gaisler's LEON3 implementation and I'm on the bare metal. ldd and std are implemented and do not trap. I measured that copying a big junk of data with ldd and std is faster by a factor of ~1.5. There are indeed data and instruction caches present and it makes sense to me that they can speed up double word operations. I also agree, that the overhead must be somehow reduced when fetching two consecutive words from memory. Thanks all for your comments.

  • 1
    The best way to know this would be to try that for a reasonably large volume of data and measure the time. – sharptooth Aug 14 '13 at 08:57
  • If `ldd` is _not implemented_ by your specific CPU, it'll be _emulated_ via the OS-provided trap handler. That'd ultimately make it slower. If you're on Solaris, `trapstat -l` will show you whether the OS set up `unimp-ldd` / `unimp-std` trap handlers; are they present, run your code and see whether they get hit ... if so, use different code ... – FrankH. Aug 14 '13 at 13:03
  • Could you tell us which implementation you are using? Typically implementations will have instruction and data caches connected to the core, with line size usually greater than a double word. That means, a double-load fetches 2 words at a time from the cache (as the on-chip bus connecting core to cache may be larger that 32 bit even though the external memory bus is 32 bit). Also, logically, a double word operation should be more efficient than two single word ops, as time for fetching and decoding these instructions is saved. But practically it may have little effect if the code is small. – Neha Karanjkar Sep 26 '13 at 11:54

0 Answers0