22

I've been tasked with generating a certain number of data-cache misses and instruction-cache misses. I've been able to handle the data-cache portion without issue.

So I'm left with generating the instruction-cache misses. I do not have any idea what causes these. Can someone suggest a method of generating them?

I'm using GCC in Linux.

  • 1
    Branch prediction comes to mind: http://en.wikipedia.org/wiki/Branch_predictor – Ed S. Mar 20 '12 at 19:42
  • Would it be possible for you to share your code on how to generate data-cache misses? – Guru Prasad Aug 05 '13 at 15:05
  • Many architectures have an instruction to invalidate a specific cache line. You could use inline asm or intrinsics to execute the appropriate instruction. – Nate Eldredge Nov 04 '22 at 04:36

6 Answers6

21

As people have explained, an instruction cache miss is conceptually the same as a data-cache miss - the instructions are not in the cache. This is because the processor's program counter (PC) has jumped to a place which hasn't been loaded into the cache, or has been flushed out because the cache got filled, and that cache line was the one chosen for eviction (usually least recently used).

It is a bit harder to generate enough code by hand to force an instruction miss than it is to force a data cache miss.

One way to get lots of code, for little effort, is to write a program which generates source code.

For example write a program to generate a function with a huge switch statement (in C) [Warning, untested]:

printf("void bigswitch(int n) {\n    switch (n) {");
for (int i=1; i<100000; ++i) {
    printf("        case %d: n += %d;\n", n, n+i/2);
}
printf("    }\n    return n;}\n");

Then you can call this from another function, and you can control how big a jump along the cache line it takes.

A property of a switch statement is the code can be forced to execute backwards, or in patterns by choosing the parameter. So you can work with the pre-fetching and prediction mechanisms, or try to work against them.

The same technique could be applied to generate lots of functions too, to ensure the cache can be 'busted' at will. So you may have bigswitch001, bigswitch002, etc. You might call this using a switch which you also generate.

If you can make each function (approximately) some number of i-cache lines in size, and also generate more functions than will fit in cache, then the problem of generating instruction cache-misses becomes easier to control.

You can see exactly how big a function, an entire switch statement, or each leg of a switch statement is by dumping the assembler (using gcc -S), or objdump the .o file. So you could 'tune' the size of a function by adjusting the number of case: statements. You could also choose how many cache lines are hit, by judicious choice of the parameter to bigswitchNNN().

gbulmer
  • 4,210
  • 18
  • 20
  • 1
    Better yet: `case %d: bigarray[%d] += %d;\n"` and thrash _both_ caches. – Mooing Duck Mar 21 '12 at 00:12
  • @Mooing Duck - yup, that would thrash both. The OP's says "my task is to generate certain number of Data cache misses and Instruction cache misses", so it seemed better to keep them separate, so that it is easier to control. – gbulmer Mar 21 '12 at 00:15
  • 2
    On x86 one can fill an array with many `0x90`'s (`NOPs`) and a terminating `0xC3` (`RET`) and use a function pointer to execute that. It may be necessary to mark the underlying memory as executable prior to execution (`VirtualProtect()` on Windows, `mprotect()` on Linux). – Alexey Frunze Mar 21 '12 at 02:39
11

In addition to all the other ways mentioned here, another very reliable way to force an instruction cache miss is to have self-modifying code.

If you write to a page of code in memory (assuming you configured the OS to permit this), then of course the corresponding line of instruction cache immediately becomes invalid, and the processor is forced to refetch it.

It is not branch prediction that causes an icache miss, by the way, but simply branching. You miss instruction cache whenever the processor tries to run an instruction that has not recently been run. Modern x86 is smart enough to prefetch instructions in sequence, so you are very unlikely to miss icache by just ordinary walking forward from one instruction to the next. But any branch (conditional or otherwise) jumps to a new address out of sequence. If the new instruction address hasn't been run recently, and isn't near the code you were already running, it is likely to be out of cache, and the processor must stop and wait for the instructions to come in from main RAM. This is exactly like data cache.

Some very modern processors (recent i7) are able to look at upcoming branches in code and start the icache prefetching the possible targets, but many cannot (video game consoles). Fetching data from main RAM to icache is totally different from the "instruction fetching" stage of the pipeline, which is what branch prediction is about.

"Instruction fetch" is part of the CPU's execution pipeline, and refers to bringing an opcode from icache into the CPU's execution unit, where it can start decoding and doing work. That is different from "instruction cache" fetching, which must happen many cycles earlier and involves the cache circuitry making a request to the main memory unit to send some bytes across the bus. The first is an interaction between two stages of the CPU pipeline. The second is an interaction between the pipeline and the memory cache and main RAM, which is a much more complicated piece of circuitry. The names are confusingly similar, but they're totally separate operations.

So one other way to cause instruction cache misses would be to write (or generate) lots of really big functions, so that your code segment is huge. Then call wildly from one function to another, so that from the CPU's point of view you are doing crazy GOTOs all over memory.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • 2
    @MooingDuck Not at all. You get a function pointer, cast it to a char *, and write some instructions through it. Totally "unspecified behavior", of course, but it works anyway if you know what you're doing. I used to do this to build DMA chains and vertex shaders on the fly. – Crashworks Mar 21 '12 at 00:20
  • You'd have to know the machine code of your processor, and not make any mistakes. It can be done, but it's quite hard. – Mooing Duck Mar 21 '12 at 00:22
  • Would simply writing the same data to the code memory (i.e., copy the code to a memory buffer, then copy that buffer back) be enough to force the icache to be invalidated? I imagine the cache control probably isn't sophisticated enough to detect that the writes are actually changing anything (but I could well be guessing wrong here). – Michael Burr Mar 21 '12 at 03:26
  • @MichaelBurr That's a good idea! You're probably right. I'm only guessing, though; I don't know much about the internal circuitry of Intel's icache. – Crashworks Mar 21 '12 at 03:32
  • @Michael Burr, I very much confused with your suggestion. Could you let me know, how can we copy some data to code memory? Are there any built in functions existing in C to access code memory? – bobby Mar 28 '12 at 13:58
  • @sreeharivallu: I'm not sure of the details of how you'd make code memory writable on Linux. My comment was a suggestion that assumed the code in question was writable and was merely a way to make it so you wouldn't need to have any knowledge of the assembly opcodes in question since you'd just be copying what the compiler generated. Asking how to make a range of code writable on Linux for this test purpose would probably be a good SO question itself. – Michael Burr Mar 28 '12 at 15:20
  • @Crashworks. How do you do it . I get a segfault when trying out code like this ( Sorry, but I know the indentation is gone for a toss ) : int foo() { int i; for ( i = 0 ; i < 100; ++i ) i=i+i; } int main() { int (*p) (); char buff[1000], j; p = buff; for ( j = 0 ; j < 10000 ; ++j ) { memcpy( p, foo, 100 ); memcpy( foo, p, 100 ); foo(); } } – vrk001 May 14 '12 at 18:13
  • @Crashworks: On any secure platform, modifying existing code mapped into memory is difficult or impossible, and so is making a new block of memory executable. It's likely to at least require `mprotect` or equivalent... – R.. GitHub STOP HELPING ICE May 14 '12 at 18:26
4

Your project requires an awareness of your target system’s cache hardware, including but not limited to its cache size (the overall size of the cache), cache line size (smallest cacheable entity), associativity, and write & replacement policies. Any really good algorithm designed to test a cache’s performance must take all of this into account, as there is no single general algorithm that would effectively test all cache configurations, though you may be able to design an effective parameterized test routine generator, which might generate a suitable test routine given enough of the particulars about a given target’s cache architecture. Despite this, I think my suggestion below is a pretty good general-case test, but first I wanted to mention:

You mention that you have a working data cache test that uses a “large integer array a[100].... [which accesses] the elements in such a way that the distance between the two elements is greater than the cache-line size(32 bytes in my case).” I am curious how you’ve determined that your test algorithm works and how you’ve determined how many data cache misses are a result of your algorithm, as opposed to misses caused by other stimuli. Indeed, with a test array of 100*sizeof(int), your test data area is only 400 bytes long on most general-purpose platforms today (perhaps 800 bytes if you’re on a 64-bit platform, or 200 bytes if you’re using a 16-bit platform). For the vast majority of cache architectures, that entire test array will fit into the cache many times over, meaning that randomized accesses to the array will bring the entire array into the cache in somewhere around (400/cache_line_size)*2 accesses, and every access after that will be a cache hit regardless of how you order your accesses, unless some hardware or OS tick timer interrupt pops in and flushes out some or all of your cached data.

With regard to the instruction cache: Others have suggested using a large switch()-case statement or function calls to functions in disparate locations, neither of which would be predictably effective without carefully (and I mean CAREFULLY) designing the size of the code in the respective case branches or locations & sizes of the disparately-located functions. The reason for this is that bytes throughout memory “fold into” (technically, “alias one another” in) the cache in a totally predictable pattern. If you carefully control the number of instructions in each branch of a switch()-case statement, you might be able to get somewhere with your test, but if you just throw a large indiscriminate amount of instructions in each, you have no idea how they will fold into the cache and which cases of the switch()-case statement alias each other in order to use them to evict each other out of the cache.

I’m guessing you’re not overly familiar with assembly code, but you’ve gotta believe me here, this project is screaming for it. Trust me, I’m not one to use assembly code where it’s not called for, and I strongly prefer programming in OO C++, using STL & polymorphic ADT hierarchies whenever possible. But in your case, there’s really no other foolproof way of doing it, and assembly will give you the absolute control over code block sizes that you really need in order to be able to effectively generate specified cache hit ratios. You wouldn’t have to become an assembly expert, and you probably woudn't even need to learn the instructions & structure required to implement a C-language prologue & epilogue (Google for “C-callable assembly function”). You write some extern “C” function prototypes for your assembly functions, and away you go. If you do care to learn some assembly, the more of the test logic you put in the assembly functions, the less of a “Heisenberg effect” you impose on your test, since you can carefully control where the test control instructions go (and thus their effect on the instruction cache). But for the bulk of your test code, you can just use a bunch of “nop” instructions (the instruction cache doesn’t really care what instructions it contains), and probably just put your processor's "return" instruction at the bottom of each block of code.

Now let’s say your instruction cache is 32K (pretty darn small by today’s standards, but perhaps still common in many embedded systems). If your cache is 4-way associative, you can create eight separate content-identical 8K assembly functions (which you hopefully noticed is 64K worth of code, twice the size of the cache), the bulk of which is just a bunch of NOP instructions. You make them all fall one after the other in memory (generally by simply defining each one-after-the-other in the source file). Then you call them from a test control function using carefully computed sequences to generate any cache hit ratio you desire (with rather course granularity since the functions are each a full 8K long). If you call the 1st, 2nd, 3rd, and 4th functions one after another, you know you’ve filled the entire cache with those test functions’ code. Calling any of those again at this point will not result in an instruction cache miss (with the exception of lines evicted by the test control function’s own instructions), but calling any of the other (5th, 6th, 7th, or 8th; let’s just choose the 5th) will evict one of the others (though which one is evicted depends on your cache’s replacement policy). At this point, the only one you can call and know you WON’T evict another is the one you just called (the 5th one), and the only ones you can call and know you WILL evict another is one you haven’t yet called (the 6th, 7th, or 8th). To make this easier, just maintain a static array sized the same as the number of test functions you have. To trigger an eviction, call the function at the end of the array & move its pointer to the top of the array, shifting the others down. To NOT trigger an eviction, call the one you most recently called (the one at the top of the array; be sure to NOT shift the others down in this case!). Do some variations on this (perhaps make 16 separate 4K assembly functions) if you need finer granularity. Of course all of this depends on the test control logic size being insignificant in comparison to the size of each associative “way” of the cache; for more positive control, you could put the test control logic in the test functions themselves, but for perfect control you’d have to design the control logic entirely without internal branching (only branching at the end of each assembly function), but I think I’ll stop here since that’s probably over-complicating things.

Off-the-cuff & not tested, the entirety of one of the assembly functions for x86 might look like this:

myAsmFunc1:
   nop
   nop
   nop  # ...exactly enough NOPs to fill one "way" of the cache
   nop  # minus however many bytes a "ret" instruction is (1?)
   .
   .
   .
   nop
   ret  # return to the caller

For PowerPC it might look like this (also untested):

myAsmFunc1:
   nop
   nop
   nop   # ...exactly enough NOPs to fill one "way" of the cache
   .     # minus 4 bytes for the "blr" instruction.  Note that
   .     # on PPC, all instructions (including NOP) are 4 bytes.
   .
   nop
   blr   # return to the caller

In both cases, the C++ and C prototypes for calling these functions would be:

extern "C" void myAsmFunc1();    // Prototype for calling from C++ code
void myAsmFunc1(void);           /* Prototype for calling from C code */

Depending on your compiler, you might need an underscore in front of the function name in the assembly code itself (but not in your C++/C function prototype).

phonetagger
  • 7,701
  • 3
  • 31
  • 55
  • I don't think the statements in switch statements should be big or arbitrary. If VERY simple statements are used in each `case:`, for example `n += 16bitint;` then the size of every `case:` can be made regular and predictable, and the size of functions controlled. – gbulmer Mar 21 '12 at 03:15
  • @gbulmer: OK, I agree... at least in principle. I guess the test functions wouldn't have to be huge; you'd just have to make at least n+1 such functions or case clauses, n being the number of associative "ways" in the cache. But to be effective, they would have to be spaced in memory PERFECTLY so that they alias one another in the cache. That would be difficult to do in "high"-level source code (C or C++) since you don't really have much control over the placement of any instructions in a compiled high-level-sourcecode function. Or just do the sure & easy way as I described above. – phonetagger Mar 21 '12 at 12:27
  • please don't misunderstand. I am not suggesting this stuff is trivial, but I do not think assembler is necessary, and to some extent desirable for these sorts of problems. I also think it is really great for folks to learn how to use their skills in unconventional ways; IMHO, folks can quickly improve their depth of understanding, which can be thrilling as well as get the job done. Let me ponder. I will try to respond with a technique which leverages knowledge of high-level tools. I think I can do it, but I am very hungry :-) – gbulmer Mar 21 '12 at 14:14
  • @gbulmer - I'm not sure why it's taking you so long to get food. :) I thought about this & I'm ready to take back my "I agree... at least in principle. I guess the test functions wouldn't have to be huge..." statement. The purpose of writing in higher level languages is to abstract the lower level details to more quickly & easily solve problems. If writing in higher level languages makes a solution more complicated or less correct, there is no glory or anything worth learning by doing so. The asm functions described above cannot be made simpler or more portable in a higher-level language. – phonetagger Mar 24 '12 at 23:23
  • @gbulmer - Since asm programming is so rare nowadays, IM(notSoHumble)O, a programmer stands a better chance of improving their depth of understanding about computers by doing a small amount of assembly... a little can go a long way, esp in understanding things like why float operations sometimes seem rounded incorrectly or cannot precisely represent decimal numbers, or why comparing against constant zero (in compiled/JIT languages) is less costly than comparing against any other constant or a variable, or why power-of-2-sized buffers can be handled more efficiently than any other size buffer. – phonetagger Mar 24 '12 at 23:24
  • Now back to the ICACHE testing stuff: There's no good reason to use a switch()-case construct or bunch of dummy functions, as you'd have to go through multiple iterations of (compile, examine resulting code size of each case, adjust the code, repeat) until you got blocks of code that were the right size to be able to evict each other when called in proper sequences. There's just no good reason to do it that way, as it would be more difficult than simply coding up a bunch of properly-sized asm functions right from the start, since you have absolute control over the size of the asm functions. – phonetagger Mar 24 '12 at 23:24
  • In my Mar 21 12:27 comment, I conced that small test funcs spaced perfectly in memory could be used to evict one another. That would allow you to cause icache misses, but it wouldn't help in the goal (if there is such a goal) of approximating a specified cache hit ratio. To do so, you have to control the ratio of executed instructions that cause evictions, and your test control function's instrs count as executed instrs. The higher the ratio of "test function" to "test control function" instrs, the higher the accuracy of your controlled cache hit ratio. Thus larger asm funcs are better. – phonetagger Mar 24 '12 at 23:50
0

I'm doing similar experiments with an ARM M7 CPU as I investigate the capabilities of the Playdate hardware and try to confirm the instruction cache size and behaviour.

I did something similar to @phonetagger's answer, using inline assembly to create functions of known size. I thought it best to generate lots of small functions, because large functions without branches will allow the branch prediction logic to work flawlessly and preload the instruction cache very effectively.

My current test scenario is based on a table of 256 function pointers, each pointing to a function that is 64 bytes long, or two cache lines (in the case of the ARM M7). In total, the 256 functions occupy 256 x 64 = 16K of memory, which is four times the 4K instruction cache size - based on the data sheet that I think matches the part in the Playdate, which also indicates that the instruction cache is 2-way associative.

My testing strategy is to repeatedly run functions that add up to a known amount of memory, and vary the amount of memory covered to assess timing when everything fits into the cache and when it doesn't. So for example, to test 2K of instruction memory I need to run 2048 / 64 = 32 of the functions, and so my code would be:

int n = 32;
for (int calls = 0; calls < 100000; calls++)
{
    functable[calls%n]();
}

I do 100,000 calls to ensure it takes long enough to be able to get consistent timings. Obviously the loop logic is also being run, but that should only consume a couple of cache lines so it shouldn't throw the results off too much.

I repeat the above test for n running from 1 to 256, thus testing 64 bytes up to 16K of instructions, and time how long it takes. Here are the results: time vs code size

I'm puzzled by a few things:

  1. Why is there an early spike in time taken, up to and a little beyond the 1K mark?
  2. Why does performance start to drop off at the 8K mark, instead of the 4K mark like I expected from the 4K cache size?
  3. Why isn't there a greater drop in performance? Performance is more than half as good when apparently missing the cache. From messing around with data caches I expected the cache load time to be a more significant hit than that.

All my functions are laid out linearly in memory, so I wondered if the CPU was prefetching subsequent functions, so I tried calling the functions in random order. I used insertion sort to randomize the first n entries in the function table before starting the timing loop. The results were very similar, though surprisingly the early spike in time taken - though still present - was lower than in the linear order case. enter image description here

In summary, I think my procedure is fairly sound, but I'm puzzled by the results and would appreciate additional insight.

yoyo
  • 8,310
  • 4
  • 56
  • 50
0

For instruction cache misses, you need to execute code segments that are far apart. Splitting your logic among multiple function calls would be one way to do that.

AShelly
  • 34,686
  • 15
  • 91
  • 152
-1

A chain of if else on unpredictable conditions (e.g. input or randomly generated data) with amount of instructions both in the if case and in the else case which size is larger than a cache line.

selalerer
  • 3,766
  • 2
  • 23
  • 33