13

I'm programming an embedded 32 system with a 32 kbyte 8-way set associative L2 instruction cache. To avoid cache thrashing we align functions in a way such that the text of a set of functions called at a high frequency (think interrupt code) ends up in separate cache sets. We do this by inserting dummy functions as needed, e.g.

void high_freq1(void)
{
   ...
}

void dummy(void)
{
   __asm__(/* Silly opcodes to fill ~100 to ~1000 bytes of text segment */);
}

void high_freq2(void)
{
   ...
}

This strikes me as ugly and suboptimal. What I'd like to do is

  • avoid __asm__ entirely and use pure C89 (maybe C99)
  • find a way to create the needed dummy() spacer that the GCC optimizer does not touch
  • the size of the dummy() spacer should be configurable as a multiple of 4 bytes. Typical spacers are 260 to 1000 bytes.
  • should be feasible for a set of about 50 functions out of a total of 500 functions

I'm also willing to explore entirely new techniques of placing a set of selected functions in a way so they aren't mapped to the same cache lines. Can a linker script do this?

Potherca
  • 13,207
  • 5
  • 76
  • 94
Jens
  • 69,818
  • 15
  • 125
  • 179
  • Why do you expect this to help? For the most part, it can reduce the number of cache lines a function is in by at most one line per function. But it comes at the cost of having useless bytes in cache lines due to the dummy functions. If `high_freq2` is executed shortly after `high_freq1`, it might benefit from having its initial bytes already in cache, in the last line of `high_freq1`. And, if you are experiencing thrashing, increasing code size may make it worse. Are you sure you are thrashing the instruction cache? – Eric Postpischil Feb 10 '14 at 15:55
  • @EricPostpischil Yes, we have proven this helps. Suppose you have several 100kHz routines mapped to the same cache line, each executing in turn. They will cause 100% cache misses. If we place the functions so they are mapped to *different* cache lines, then we have an almost 100% hit rate for those high frequency functions. Since we know exactly how the 8-way set associative cache works, i.e. which parts of the address space are mapped on which line, we can optimize by moving the "working set" to the appropriate addresses. – Jens Feb 10 '14 at 16:12
  • Can you write your own linker script? An linker script tells the linker how to pack up your program. You can specify multiple text sections and specify where they go etc. – Jimbo Feb 10 '14 at 16:15
  • @Jimbo Yes, I believe we could. It's a vxWorks 5.4 (ancient, I know) application. – Jens Feb 10 '14 at 16:22
  • 1
    @Jens: How can routines mapped to the same cache line cause any misses? If A and B are in the same cache line, and A is in cache, then B is in cache. Do you mean cache sets, not cache lines? – Eric Postpischil Feb 10 '14 at 16:33
  • Have you considered using the GCC `hot` function attribute? it should improve cache locality somewhat. – Hasturkun Feb 11 '14 at 15:59
  • @EricPostpischil Yes, cache sets. Critical functions are generally larger than a single cache line of 32 bytes. What we try to obtain is an even population of cache sets less than or equal to the associativity (<= 8). – Jens Feb 12 '14 at 13:42
  • @Jens: Simply placing functions consecutively minimizes the number of cache sets straddled. This is because consecutive use of addresses proceeds through each cache set before it repeats one. At most, you would need to align the first of them to a cache line, to avoid an unnecessary extra fragment. – Eric Postpischil Feb 12 '14 at 13:59

4 Answers4

4

Use GCC's __attribute__(( aligned(size) )).

Or, pass -falign-functions=n on your GCC command line.

Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
  • I don't think this takes cache line mapping into account, only alignment of the function itself. What I need is a way to express "leave 4*N bytes between the end of function a() and the start of function b()". – Jens Feb 10 '14 at 15:58
  • Specifying an alignment will make sure that the function start address will live at an address divisible by the cache line size, which is precisely what you asked for. Other functions that have no such requirement may share the cache line, but that is not necessarily a bad thing. – Simon Richter Feb 10 '14 at 16:11
  • Then I was imprecise. Again, I don't need the address aligned divisible by N, I need it affected in a way so that different functions are properly aligned *with respect to address space <-> cache line mapping*. This is a bit of rocket science, I admit. :-) – Jens Feb 10 '14 at 16:19
  • Cache lines start on aligned addresses. Hence, aligning to a multiple of the cache line size will ensure that the function starts at the beginning of a cache line. – Simon Richter Feb 12 '14 at 14:21
3

Maybe linker scripts are the way to go. The GNU linker can use these I think... I've used LD files for the AVR and on MQX both of which we using GCC based compilers... might help...

You can define your memory sections etc and what goes where... Each time I come to write one its been so long since the last I have to read up again...

Have a search for SVR3-style command files to gem up.

DISCLAIMER: Following example for a very specific compiler... but the SVR3-like format is pretty general... you'll have to read up for your system

For example you can use commands like...

ApplicationStart = 0x...;
MemoryBlockSize = 0x...;
ApplicationDataSize  = 0x...;
ApplicationLength    = MemoryBlockSize - ApplicationDataSize;

MEMORY {
    RAM: ORIGIN = 0x...                LENGTH = 1M
    ROM: ORIGIN = ApplicationStart     LENGTH = ApplicationLength   
}

This defines three memory sections for the linker. Then you can say things like

SECTIONS
{
    GROUP :
    {       
        .text :
        {
            * (.text)
            * (.init , '.init$*')
            * (.fini , '.fini$*')
        }

        .my_special_text ALIGN(32): 
        {
            * (.my_special_text)
        } 

        .initdat ALIGN(4):
        // Blah blah
    } > ROM
    // SNIP
}

The SECTIONS command tells the linker how to map input sections into output sections, and how to place the output sections in memory... Here we're saying what is going into the ROM output section, which we defined in the MEMORY definition above. The bit possible of interest to you is .my_special_text. In your code you can then do things like...

__attribute__ ((section(".my_special_text")))
void MySpecialFunction(...)
{
    ....
}

The linker will put any function preceded by the __attribute__ statement into the my_special_text section. In the above example this is placed into ROM on the next 4 byte aligned boundary after the text section, but you can put it anyway you like. So you could make a few sections, one for each of the functions you describe, and make sure the addresses won't cause clashes...

You can the size and memory location of the section using linker defined variables of the form

extern char_fsection_name[]; // Set to the address of the start of section_name
extern char_esection_name[]; // Set to the first byte following section_name

So for this example...

extern char _fmy_special_text[]; // Set to the address of the start of section_name
extern char _emy_special_text[]; // Set to the first byte following section_name
Jens
  • 69,818
  • 15
  • 125
  • 179
Jimbo
  • 4,352
  • 3
  • 27
  • 44
3

If you are willing to expend some effort, you can use

__attribute__((section(".text.hotpath.a")))

to place the function into a separate section, and then in a custom linker script explicitly place the functions.

This gives you a bit more fine-grained control than simply asking for the functions to be aligned, but requires more hand-holding.

Example, assuming that you want to lock 4KiB into cache:

SECTIONS {
    .text.hotpath.one BLOCK(0x1000) {
        *(.text.hotpath.a)
        *(.text.hotpath.b)
    }
}
ASSERT(SIZEOF(.text.hotpath.one) <= 0x1000, "Hot Path functions do not fit into 4KiB")

This will make sure the hot path functions a and b are next to each other and both fit into the same block of 4 KiB that is aligned on a 4 KiB boundary, so you can simply lock that page into the cache; if the code doesn't fit, you get an error.

You can even use

NOCROSSREFS(.text.hotpath.one .text)

to forbid hot path functions calling other functions.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64
2

Assuming you're using GCC and GAS, this may be a simple solution for you:

void high_freq1(void)
{
   ...
}
asm(".org .+288"); /* Advance location by 288 bytes */
void high_freq2(void)
{
   ...
}

You could, possibly, even use it to set absolute locations for the functions rather than using relative increments in address, which would insulate you from consequences due to the functions changing in size when/if you modify them.

It's not pure C89, for sure, but it may be less ugly than using dummy functions. :)

(Then again, it should be mentioned that linker scripts aren't standardized either.)

EDIT: As noted in the comments, it seems to be important to pass the -fno-toplevel-reorder flag to GCC in this case.

Dolda2000
  • 25,216
  • 4
  • 51
  • 92
  • A quick test with a short main() followed by the asm you suggested turned out that the high_freq was placed at 288 instead of .+288, as if `. = 0` everywhere it is used. – Jens Feb 11 '14 at 09:05
  • @Jens: That's strange; it works for me. Do you have some weird/unusual `as`? – Dolda2000 Feb 11 '14 at 09:48
  • No, a powerpc-gcc 4.8 toolchain with GNU assembler version 2.23.2 (powerpc-wrs-vxworks). I noticed the behavior changes with the use of `-O` optimization... – Jens Feb 11 '14 at 11:14
  • @Jens: Huh, indeed. I do find, however, that that is not because `.` or `.org` stop working, but because GCC does some top-level function reordering when `-O` is turned on. It can be fixed by using `-fno-toplevel-reorder`. Which, it seems, you should be using anyway, considering your toplevel order matters. :) – Dolda2000 Feb 11 '14 at 17:41