3

I am using gettimeofday() to time a simple matrix multiply example, but I'm getting results that are close to twice too long initially. On a RHEL6 Server machine, I'm getting "bad" timing results for up to nearly 1 second (~65 individual timings in this example). All our other machines are RHEL5 Workstation boxes and this code works much better on them; I only get a couple of "bad" results initially (for first ~20 milliseconds).

From posts on this site, I think this probably has something to do with the OS process scheduler. If I uncomment the first "for" statement below (thereby inserting an initial busy loop by repeatedly initializing the matrices a, b and c), I get zero "bad" results under both RHEL5 Workstation and RHEL6 Server. Alternatively, if I uncomment the sleep statement, I get ALL "bad" timing results for both RHEL5 & RHEL6.

For some reason, my process is starting up with only about half the access to the CPU initially, then it gets "full" access to the CPU as long as the process stays busy. If it "sleeps" and then resumes timing, it again is temporarily only getting about half the full access to the CPU.

Nothing else is happening on the machine (X is not running). I have tried "chrt" to control the priority of the process, but that changed nothing. I've verified this occurs with both GCC 4.4.6 and ICC 12.1.0. I've tried "nice" as well.

Here's the code:

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#define N 225
#define DELAY_LOOPS 8000
main() {
  struct timeval _t0, _t1, _t2;
  double a[N][N], b[N][N], c[N][N];
  double millisec, cum_ms;
  int i, j, k, l, m=0;
  gettimeofday( &_t0, NULL );
  // for( l=0; l<DELAY_LOOPS; l++ )
    for( i=0; i<N; i++ )
      for( j=0; j<N; j++ ) {
        a[i][j]=0;
        b[i][j]=i;
        c[i][j]=j;
      }
  for( l=0; l<75; l++ ) {
    gettimeofday( &_t1, NULL );
    for( i=0; i<N; i++ )
      for( j=0; j<N; j++ )
        for( k=0; k<N; k++ )
          a[i][j]+=b[i][k]*c[k][j];
    gettimeofday( &_t2, NULL );
    millisec  = 1000*(_t2.tv_sec-_t1.tv_sec);
    millisec += 1e-3*(_t2.tv_usec-_t1.tv_usec);
    cum_ms  = 1000*(_t2.tv_sec-_t0.tv_sec);
    cum_ms += 1e-3*(_t2.tv_usec-_t0.tv_usec);
    printf( "%d: duration %fms, cumulative %fms\n",
            m++, millisec, cum_ms );
    // sleep( 2 );
  }
  printf( "a[%d][%d]=%f\n", N/2, N/2, a[N/2][N/2] );
}

and here are the results:

% icc -O2 -o test main.c; ./test
0: duration 13.049000ms, cumulative 13.677000ms
1: duration 13.026000ms, cumulative 26.753000ms
2: duration 12.911000ms, cumulative 39.668000ms
3: duration 12.913000ms, cumulative 52.584000ms
4: duration 12.914000ms, cumulative 65.501000ms
5: duration 12.911000ms, cumulative 78.415000ms
6: duration 12.912000ms, cumulative 91.331000ms
/* snip */
64: duration 12.912000ms, cumulative 840.633000ms
65: duration 10.455000ms, cumulative 851.092000ms
66: duration 5.910000ms, cumulative 857.004000ms
67: duration 5.908000ms, cumulative 862.914000ms
68: duration 5.907000ms, cumulative 868.823000ms
69: duration 5.908000ms, cumulative 874.732000ms
70: duration 5.912000ms, cumulative 880.646000ms
71: duration 5.907000ms, cumulative 886.554000ms
72: duration 5.907000ms, cumulative 892.462000ms
73: duration 5.908000ms, cumulative 898.372000ms
74: duration 5.908000ms, cumulative 904.281000ms
a[112][112]=211680000.000000

I experience the problem regardless of optimization level (-O0, -O1, -O2, etc.).

Does anyone know anything about how scheduling is done under RHEL6 Server? Is it so very different from RHEL5 Workstation? I presume the difference I'm seeing is more a function of the fact that one box is a server edition of RHEL and the other is a workstation edition (rather than the difference between release 5 vs. 6). Is there some simple way to reduce this effect under RHEL6 Server and make it act more like the RHEL5 Workstation boxes?

Any ideas? Thanks.

chrisG
  • 65
  • 4

2 Answers2

4

Don't use gettimeofday(2) for performance measurement. It's way too slow, and simply not designed for the job.

Use clock_gettime(2) instead. It allows you to select from one of a number of system-defined timers. CLOCK_REALTIME is the easiest choice, but CLOCK_PROCESS_CPUTIME_ID might be better if you have it.

alex tingle
  • 6,920
  • 3
  • 25
  • 29
2

Could the processor be entering a low-power state? Something like powertop can tell you that. (in the Frequency stats tab)

Greg
  • 12,119
  • 5
  • 32
  • 34
  • Very interesting suggestion. I hadn't heard of powertop. There's nothing under "P-States (frequencies)" at the top. I have to look at this some more to understand what I'm looking at. Thank you for the suggestion. I'd still like to know how I can make my RHEL6 Server box run more like all the RHEL5 Workstation boxes in this regard. Thanks. – chrisG Dec 30 '11 at 21:25
  • I think this is the best lead so far. Powertop 2.0 [link](http://www.h-online.com/open/features/Powertop-2-0-saving-power-under-Linux-1257057.html) seems very impressive. Unfortunately, it requires a relatively recent kernel (>2.6.36), so it's not an option at the moment for me under either RHEL5 or RHEL6. I will, however, try to use on my laptop (Ubuntu 11.10) to see what I can find out. – chrisG Jan 03 '12 at 20:23
  • This appears to be what is happening. I was able to install Powertop 1.97. Unfortunately, it gave me some very misleading information under the "Tunables" tab. It told me that the "ondemand" cpufreq governor was running. In fact, cpufreq is not running at all. *BUT* cpuidle is in fact active (specifically intel_idle is running in place of acpi_idle but I've switching between the two). Great info on cpuidle here: [link](http://lwn.net/Articles/384146/). Bottom line: all my twelve cores are spending *all* their time in the lowest power state (high latency) when I sleep btw timing calls. – chrisG Jan 04 '12 at 16:25
  • 1
    *SOLUTION FOUND*: Turn off ACPI altogether! This is a big-hammer solution, but it's the only thing I've found that works. Computer's purpose is to run models at full-bore to compare OpenCL, CUDA and OpenMP solutions. Saving power is not a priority. I give up trying to tune ACPI. From what I've found using google, it appears that there's a bug in ACPI. With ACPI on under RHEL6, timing is very difficult because a busy loop is required first to "get the CPU up to speed". If the application uses OpenMP, the busy loop must also. No impact under RHEL5 Server (minimal under RHEL5 Workstation). – chrisG Jan 04 '12 at 17:51
  • Just saw your comments. Glad you figured it out. – Greg Feb 20 '12 at 05:43