Your project requires an awareness of your target system’s cache hardware, including but not limited to its cache size (the overall size of the cache), cache line size (smallest cacheable entity), associativity, and write & replacement policies. Any really good algorithm designed to test a cache’s performance must take all of this into account, as there is no single general algorithm that would effectively test all cache configurations, though you may be able to design an effective parameterized test routine generator, which might generate a suitable test routine given enough of the particulars about a given target’s cache architecture. Despite this, I think my suggestion below is a pretty good general-case test, but first I wanted to mention:
You mention that you have a working data cache test that uses a “large integer array a[100].... [which accesses] the elements in such a way that the distance between the two elements is greater than the cache-line size(32 bytes in my case).” I am curious how you’ve determined that your test algorithm works and how you’ve determined how many data cache misses are a result of your algorithm, as opposed to misses caused by other stimuli. Indeed, with a test array of 100*sizeof(int), your test data area is only 400 bytes long on most general-purpose platforms today (perhaps 800 bytes if you’re on a 64-bit platform, or 200 bytes if you’re using a 16-bit platform). For the vast majority of cache architectures, that entire test array will fit into the cache many times over, meaning that randomized accesses to the array will bring the entire array into the cache in somewhere around (400/cache_line_size)*2 accesses, and every access after that will be a cache hit regardless of how you order your accesses, unless some hardware or OS tick timer interrupt pops in and flushes out some or all of your cached data.
With regard to the instruction cache: Others have suggested using a large switch()-case statement or function calls to functions in disparate locations, neither of which would be predictably effective without carefully (and I mean CAREFULLY) designing the size of the code in the respective case branches or locations & sizes of the disparately-located functions. The reason for this is that bytes throughout memory “fold into” (technically, “alias one another” in) the cache in a totally predictable pattern. If you carefully control the number of instructions in each branch of a switch()-case statement, you might be able to get somewhere with your test, but if you just throw a large indiscriminate amount of instructions in each, you have no idea how they will fold into the cache and which cases of the switch()-case statement alias each other in order to use them to evict each other out of the cache.
I’m guessing you’re not overly familiar with assembly code, but you’ve gotta believe me here, this project is screaming for it. Trust me, I’m not one to use assembly code where it’s not called for, and I strongly prefer programming in OO C++, using STL & polymorphic ADT hierarchies whenever possible. But in your case, there’s really no other foolproof way of doing it, and assembly will give you the absolute control over code block sizes that you really need in order to be able to effectively generate specified cache hit ratios. You wouldn’t have to become an assembly expert, and you probably woudn't even need to learn the instructions & structure required to implement a C-language prologue & epilogue (Google for “C-callable assembly function”). You write some extern “C” function prototypes for your assembly functions, and away you go. If you do care to learn some assembly, the more of the test logic you put in the assembly functions, the less of a “Heisenberg effect” you impose on your test, since you can carefully control where the test control instructions go (and thus their effect on the instruction cache). But for the bulk of your test code, you can just use a bunch of “nop” instructions (the instruction cache doesn’t really care what instructions it contains), and probably just put your processor's "return" instruction at the bottom of each block of code.
Now let’s say your instruction cache is 32K (pretty darn small by today’s standards, but perhaps still common in many embedded systems). If your cache is 4-way associative, you can create eight separate content-identical 8K assembly functions (which you hopefully noticed is 64K worth of code, twice the size of the cache), the bulk of which is just a bunch of NOP instructions. You make them all fall one after the other in memory (generally by simply defining each one-after-the-other in the source file). Then you call them from a test control function using carefully computed sequences to generate any cache hit ratio you desire (with rather course granularity since the functions are each a full 8K long). If you call the 1st, 2nd, 3rd, and 4th functions one after another, you know you’ve filled the entire cache with those test functions’ code. Calling any of those again at this point will not result in an instruction cache miss (with the exception of lines evicted by the test control function’s own instructions), but calling any of the other (5th, 6th, 7th, or 8th; let’s just choose the 5th) will evict one of the others (though which one is evicted depends on your cache’s replacement policy). At this point, the only one you can call and know you WON’T evict another is the one you just called (the 5th one), and the only ones you can call and know you WILL evict another is one you haven’t yet called (the 6th, 7th, or 8th). To make this easier, just maintain a static array sized the same as the number of test functions you have. To trigger an eviction, call the function at the end of the array & move its pointer to the top of the array, shifting the others down. To NOT trigger an eviction, call the one you most recently called (the one at the top of the array; be sure to NOT shift the others down in this case!). Do some variations on this (perhaps make 16 separate 4K assembly functions) if you need finer granularity. Of course all of this depends on the test control logic size being insignificant in comparison to the size of each associative “way” of the cache; for more positive control, you could put the test control logic in the test functions themselves, but for perfect control you’d have to design the control logic entirely without internal branching (only branching at the end of each assembly function), but I think I’ll stop here since that’s probably over-complicating things.
Off-the-cuff & not tested, the entirety of one of the assembly functions for x86 might look like this:
myAsmFunc1:
nop
nop
nop # ...exactly enough NOPs to fill one "way" of the cache
nop # minus however many bytes a "ret" instruction is (1?)
.
.
.
nop
ret # return to the caller
For PowerPC it might look like this (also untested):
myAsmFunc1:
nop
nop
nop # ...exactly enough NOPs to fill one "way" of the cache
. # minus 4 bytes for the "blr" instruction. Note that
. # on PPC, all instructions (including NOP) are 4 bytes.
.
nop
blr # return to the caller
In both cases, the C++ and C prototypes for calling these functions would be:
extern "C" void myAsmFunc1(); // Prototype for calling from C++ code
void myAsmFunc1(void); /* Prototype for calling from C code */
Depending on your compiler, you might need an underscore in front of the function name in the assembly code itself (but not in your C++/C function prototype).