10

I basically need some help to explain/confirm some experimental results.

Basic Theory

A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.

Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).

Experiment

I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.

I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.

The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.

Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.

When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.

When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.

The basic experimental setup is similiar to the one described here: "Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".

The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.

This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.

Results

The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).

A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.


Instruction-retirement rate Vs CPU frequency for a CPU-bound application. Results for the CPU-bound case.


Instruction-retirement rate Vs CPU frequency for a memory-bound application. Results for the memory-bound case.


The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.

However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.

Question

What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?

Safayet Ahmed
  • 698
  • 5
  • 14
  • Do you have an SSCCE for this? It's interesting nonetheless. – Mysticial Dec 18 '13 at 20:38
  • @Mystical Can you explain what you mean? I am not sure what sort of example you are referring to. – Safayet Ahmed Dec 18 '13 at 20:45
  • Maybe a sample code that we can run to try to reproduce this ourselves. Also, what processor is this and how are you changing the frequency? Multiplier? Base clock? Are you changing your memory divider strap? – Mysticial Dec 18 '13 at 20:49
  • How did you set the CPU frequency? Constant base clock and varied multiplier? – usr Dec 18 '13 at 21:40
  • How the CPU frequency is set: During runtime, not from the BIOS. Through the Linux cpufreq subsystem. I set the CPU frequency governor to "userspace" and varied the frequency across the available settings. There is a great series of lectures on IBM developer works on the topic: http://www.ibm.com/developerworks/linux/library/l-cpufreq-1/ @Mystical. There is too much code to provide and I'm not sure how much of it I can share. Sorry. – Safayet Ahmed Dec 18 '13 at 21:46
  • Just to make sure - how do you allocate the nodes for the linked list, did you make sure they're not allocated contiguously? Also, are you getting any interrupts, assists, or any side job that may be cpu-bound and benefit from the freq change? – Leeor Dec 18 '13 at 22:37
  • @Leeor Say I need a linked list of length N. (1) I allocate contiguous memory to store the N cache-lines. Call this the cache-line array, ARRAY. (2) Then, I get (N-1) uniformly distributed random numbers and sort them. The new indices of the sorted numbers is actually the random sequence I'm interested in. This sequence of indices consists of the integers 1 to (N-1) in a randomized order. Call this sequence SEQUENCE. Assume that I'm doing this in MATLAB and that the links in my linked list are array indices. I do the following: ARRAY([0, SEQUENCE]) = [SEQUENCE, 0]. – Safayet Ahmed Dec 19 '13 at 02:50
  • @Leeor (Continuing my comment above): As a result, ARRAY(0) = SEQUENCE(0), ARRAY(SEQUENCE(i)) = SEQUENCE(i+1), ..., ARRAY(SEQUENCE(N-1)) = 0. i.e. ARRAY element SEQUENCE(i) points to ARRAY element SEQUENCE(i+1), and ARRAY element SEQUENCE(N-1), the last in the list, points to ARRAY element 0. – Safayet Ahmed Dec 19 '13 at 02:55
  • If the experimental setup is valid, and you are not varying the memory clock, then I would expect a step-like (with possible anti-steps) behavior as the CPU clock crosses (/just misses) integral multiples of the memory clock. – Chris Stratton Dec 19 '13 at 02:55
  • @SafayetAhmed, that's a good way, although to reduce TLB thrashing and remove some noise i'd try to shuffle the sequence only within the 4k pages – Leeor Dec 19 '13 at 06:49
  • @ChrisStratton Can you add an answer elaborating on this? I didn't fully understand why. – Safayet Ahmed Dec 19 '13 at 15:05

1 Answers1

1

You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.

The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.

(please draw nicer lines of best fit)

user3125280
  • 2,779
  • 1
  • 14
  • 23