3

I am using the C program below (compiled with gcc) to measure the computing time of a given code section. The issue is that when usleep() function is used outside the code section being profiled, different measurements are obtained in elapsed_time.

#include <unistd.h>
#include <stdio.h>
#include <time.h>

int main()
{
    int f = 0;
    while (f < 10)
    {
        double sum = 1.0;
        struct timespec t0, t1;

        clock_gettime(CLOCK_MONOTONIC, &t0);

        // CODE SECTION BEING PROFILED
        for (unsigned int i = 0; i < 5e6; i++)
            sum += 0.001;
        // END OF CODE SECTION BEING PROFILED

        clock_gettime(CLOCK_MONOTONIC, &t1);


        double elapsed_time = t1.tv_sec - t0.tv_sec;
        elapsed_time += (t1.tv_nsec - t0.tv_nsec) * 1e-9;

        printf("%.2f ms\n", elapsed_time*1000 );

        f++;

        // Sleep causing fake clock_gettime measurements?
        usleep(50000);
    }

    return 0;
}

Below I paste the result of the programs when uleep(50000) is commented:

13.83 ms
14.30 ms
14.41 ms
13.77 ms
13.85 ms
13.88 ms
14.55 ms
13.28 ms
13.70 ms
13.56 ms

Now, the results when uleep(50000) is present:

15.37 ms
14.16 ms
**36.43 ms
39.39 ms
36.38 ms
36.27 ms
34.14 ms
38.52 ms
38.18 ms
37.53 ms**

I tried to measure times with different clocks in clock_gettime(), and also using the C++ std::chrono::high_resolution_clock. However, the results are similar in all cases. Moreover, I tried different gcc flags e.g. O2. In this case, although much smaller times are obtained, still some measured values when usleep() is present are ~3x higher than when usleep() is not present.

I found that the amount of microseconds used in usleep() affects the resulting measured times, e.g. with 10000us I obtain:

13.75 ms
13.54 ms
13.34 ms
14.15 ms
13.12 ms
12.79 ms
**28.72 ms
25.84 ms
26.18 ms
24.53 ms**

After all theses tests I cannot find what is causing these "fake" time measurements in this program.


Edit:

After some further testing I realized that the same problem is being caused by printf() (see the code below).

#include <unistd.h>
#include <stdio.h>
#include <time.h>

int main()
{
    double total_time = 0.0;
    int f = 0;
    while (f < 1000)
    {
        double sum = 1.0;
        struct timespec t0, t1;

        clock_gettime(CLOCK_MONOTONIC, &t0);

        // Code section being profiled
        for (unsigned int i = 0; i < 5e6; i++)
            sum += 0.001;
        // End of Code section being profiled

        clock_gettime(CLOCK_MONOTONIC, &t1);


        double elapsed_time = t1.tv_sec - t0.tv_sec;
        elapsed_time += (t1.tv_nsec - t0.tv_nsec) * 1e-9;
        total_time += elapsed_time;


        f++;

        // Sleep/printf causing fake clock_gettime measurements?
        printf("%.2f ms\n", elapsed_time*1000 );
        // usleep(10000);
    }
    printf("TOTAL TIME: %.2f ms\n", total_time*1000 );
    return 0;
}

Results

When I run this code compiled with g++ -g f, I obtain a total time of:

  • ~13071.42 ms when line printf("%.2f ms\n", elapsed_time*1000 ); is present (nota that this line is again outside the section being profiled).
  • ~12712.22 ms, when that line is commented.

If the program is compiled using optimization (g++ -O2), I obtain a total time of:

  • ~3145.27 ms when that printf() is present.
  • ~2741.45 ms when that printf() is not present.

With printf() the measured times increases to a lesser extent compared to usleep(). So, the usleep() doesn't seem to be the only function causing this problem... Anyway the measured time should not be affected for code placed outside the code section being profiled... Am I missing something?


Edit 2:

The different clocks I tried are: CLOCK_MONOTONIC_RAW, CLOCK_PROCESS_CPUTIME_ID and CLOCK_REALTIME. All of them offer the same behaviour.

Moreover, I verified the return value of clock_gettime() to discard failures.

genpfault
  • 51,148
  • 11
  • 85
  • 139
R2pro
  • 113
  • 1
  • 8
  • Note that in the code shown the timed section has no visible side-effects (as far as I can see) and the compiler is, therefore, free to simply elide it leaving you with two consecutive calls to `clock_gettime`. That being the case I'm a bit suspicious about the timings you give. Was the code compiled with optimizations enabled? – G.M. Feb 22 '20 at 11:21
  • No, I just used `g++ -g` for the results I included in the post. Moreover, I just posted this simply example to show the problem, but I am facing this issue in a more complex program where the profiled section is quite larger. Thanks for yout answer – R2pro Feb 22 '20 at 11:24
  • 2
    The first run always seems to get correctly timed on my machine. You are right, it seems like `usleep()` seems to interact with successive clock measurements somehow. Very interesting. – Marco Bonelli Feb 22 '20 at 12:36
  • Thanks for running it. I will keep doing tests trying to find out where this behaviour comes from. – R2pro Feb 22 '20 at 15:50
  • 1
    I'm starting to think this may have something to do with scheduling. The more time your process takes, the more likely it is to be preempted by the kernel (and that's likely happening in the middle of your busyloop since that's where 90% of the time is spent). I cannot confirm this without creating some custom kernel module to hook the scheduler and do a check though. – Marco Bonelli Feb 24 '20 at 19:16
  • Could that be happening even when CLOCK_MONOTONIC_RAW is used? This clock is supposed to provide raw hardware-based time measurements. However I tested it with same results. – R2pro Feb 24 '20 at 20:55
  • @R2pro if what I think is true, then *any* clock would give the same results, that's because the measurements are correct, and the measured section *really* takes more time. I still cannot say it for sure though. – Marco Bonelli Feb 24 '20 at 21:37
  • Interestingly enough, I get different results on my laptop (kernel v5.1): in both cases measurements seem fine. What's your kernel version? – Marco Bonelli Feb 24 '20 at 21:53
  • My kernel version is 4.15.0-88-generic. Following your tests, I will try this afternoon in another PC with kernel version 5 – R2pro Feb 25 '20 at 07:48
  • I have performed the tests in another machine with kernel v5.3.0-40 and I still get significant differences in the measured times (almost ~2x when usleep is there) – R2pro Feb 25 '20 at 21:05
  • 1
    @R2pro I think I've got the answer. Do `cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` and tell me what's the output. Then do `echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor` before running the test, confirm that it was correctly set by doing `cat` again, then run the test and tell me the results. – Marco Bonelli Feb 26 '20 at 16:10
  • The output of `cat` was `powersave` (x8). I changed to `performance`(x8) as suggested but the results remains the same. – R2pro Feb 27 '20 at 07:51
  • @R2pro what happens if you temporarily change it to `userspace` and then do `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_setspeed` ? – Marco Bonelli Feb 27 '20 at 16:35
  • (of course still check that it was correctly changed) – Marco Bonelli Feb 28 '20 at 12:28
  • I cannot change it to `userspace`. Just `powersave` and `performance` – R2pro Feb 28 '20 at 22:35
  • @Marco Bonelli can you change yo `userspace` in kernel 5.1? – R2pro Mar 01 '20 at 15:09
  • @R2pro yes I can. It depends on the scaling driver that you have. I will post a detailed answer to later today don't worry. – Marco Bonelli Mar 01 '20 at 15:12
  • Ok. I only tried in the PC with version 4.15 but I will later try to change to userspace in the one with 5.3 – R2pro Mar 01 '20 at 15:14

1 Answers1

2

Looks like you're being a victim of CPU frequency scaling.

CPU frequency scaling

Your CPU scaling driver regulates the CPU frequency according to various factors, but most importantly according to the current CPU load. There are various drivers available for the Linux kernel, and different scaling governors. Each driver can support a certain set of scaling governors and behave differently according to the currently active governor.

For example, the default driver on my laptop is intel_pstate, with available governors powersave and performance. If I change driver to acpi-cpufreq I also get ondemand, userspace and conservative.

The main difference between governors is implied by their name: the powersave governor will try to keep the CPU at a lower frequency to save power, while the performance governor will try to keep the CPU at a high frequency to make things go faster.

In some cases, the performance governor just fixes the CPU frequency to the maximum available (for me, this is what happens with the intel_pstate driver). In other cases, it will just try to keep it "high", but it still makes adjustments (for me, this is what happens with the acpi-cpufreq driver).

In any case, a combination of scaling driver and governor that dynamically regulates the CPU frequency, will take into account the CPU load in order to make frequency adjustments on the fly when possible. This can either happen when the kernel code is entered through a syscall, or on a scheduling tick when the kernel itself runs the scheduler to re-schedule the currently running processes.

What is happening?

You most probably have a combination of scaling driver and governor that dynamically sets the CPU frequency.

When your code runs without calling usleep(), the CPU is under heavy load and the frequency is mostly stable. However, if your code calls usleep(), the CPU load is reduced significantly, and the scaling driver lowers the CPU frequency. By the time the process is waken up by the kernel after the sleep, the frequency is much lower, and it takes some time before the scaling driver realizes that it needs to be boosted up. On top of this, if you keep sleeping in regular intervals, chances are that the driver will never have enough time to re-scale the frequency back up and the code will just run slower.

This also applies to printf(), which needs to make a write() syscall in order to print text, and to almost any other syscall. Switching between user space and kernel space slows down the process, thus making the frequency get lowered by the scaling driver. This does not happen for some syscalls like clock_gettime(), which is optimized to run in user space and does not require a context-switch.

Here's an example, on my machine, with a dynamic governor (e.g. ondemand):

demo

You can clearly see the CPU frequency getting stuck at the maximum value during the first run (without usleep()), and then fluctuating up and down without enough time to get steady during the second run (with usleep()). You can in fact notice that the avg time for Test #2 is almost 3x that of Test #1. This also happens for acpi-cpufreq and governor performance on my machine.

Solution

If you are able to set up a different scaling driver/governor combination that keeps a fixed CPU frequency, you will see no difference (or anyway a very small difference) in timing between the two versions of your code.

Here's another example, on my machine, with a static CPU frequency (e.g. using the userspace governor and setting a fixed speed manually):

demo 2

As you can see, both tests run in about the same time.

If you are not able to set a different scaling governor, try changing the scaling driver of your kernel. If running Debian or Ubuntu, you may have loadable modules for different scaling drivers.

You can see available scaling drivers and governors for your kernel by looking at your current kernel configuration:

cat /boot/config-$(uname -r) | grep FREQ

For example I can see:

...
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
...

Where m means "available as a module" (loadable with modprobe) and y means "built-in".

You could then, for example, try to do:

# Load acpi-cpufreq since we have CONFIG_X86_ACPI_CPUFREQ=m
sudo modprobe acpi_cpufreq 

# Switch driver
echo acpi-cpufreq | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver

# Check available governors for the driver
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors

Test code

Here's the benchmarking code used for the above GIFs:

#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <sched.h>
#include <sys/types.h>

/**
 * Marco Bonelli - 2020-03-02
 * https://stackoverflow.com/a/60481392/3889449
 *
 * Test the effect of different scaling governors on CPU frequency and
 * performance under:
 *
 *   1) Continuous heavy load.
 *   2) Intermittent and short heavy load.
 *
 * gcc -O3 scaling_governor_test.c -o test
 * ./test [N_RUNS] [N_CYCLES_PER_RUN] [TEST2_DELAY_US]
 */

#define DEFAULT_RUNS   1000
#define DEFAULT_CYCLES 1000 * 1000
#define DEFAULT_DELAY  100 * 1000

// Don't optimize this as GCC would basically trash the whole function.
#pragma GCC push_options
#pragma GCC optimize("O0")
void __attribute__ ((noinline)) func(unsigned n) {
    double sum = 1.0;

    for (unsigned i = 0; i < n; i++)
        sum += 0.001;
}
#pragma GCC pop_options

void warmup(unsigned runs, unsigned cycles) {
    for (unsigned n = 1; n <= runs; n++)
        func(cycles);
}

double bench(unsigned n) {
    struct timespec t0, t1;

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t0);
    func(n);
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

    return (t1.tv_sec - t0.tv_sec)*1000.0L + (t1.tv_nsec - t0.tv_nsec)/1000.0L/1000.0L;
}

void setup_affinity(void) {
    cpu_set_t set;

    CPU_ZERO(&set);
    CPU_SET(0, &set);

    if (geteuid() == 0) {
        if (sched_setaffinity(0, sizeof(set), &set) == 0)
            puts("Affinity set to CPU #0.");
        else
            perror("sched_setaffinity");
    } else {
        puts("Running as normal user, run as root to set CPU affinity.");
    }
}

int main(int argc, char **argv) {
    unsigned runs, cycles, delay;
    double cur, tot1, tot2, min, max, avg;

    if (argc < 2 || sscanf(argv[1], "%i", &runs) != 1 || runs < 1)
        runs = DEFAULT_RUNS;

    if (argc < 3 || sscanf(argv[2], "%i", &cycles) != 1 || cycles < 1)
        cycles = DEFAULT_CYCLES;

    if (argc < 4 || sscanf(argv[3], "%i", &delay) != 1 || delay < 1)
        delay = DEFAULT_DELAY;

    setup_affinity();

    printf("Benchmarking %u runs of %u cycles each.\n", runs, cycles);
    printf("Test #1 will proceed normally.\nTest #2 will usleep(%u) before each run.\n", delay);
    fputs("Warming up... ", stdout);
    fflush(stdout);

    warmup(10, cycles);

    puts("done.\n---");

    tot1 = 0;
    min = INFINITY;
    max = -INFINITY;

    for (unsigned n = 1; n <= runs; n++) {
        cur = bench(cycles);

        tot1 += cur;
        avg = tot1 / n;
        if (cur < min) min = cur;
        if (cur > max) max = cur;

        printf("\rTest #1: tot %-9.3f  avg %-7.3f  min %-7.3f  max %-7.3f [ms]", tot1, avg, min, max);
        fflush(stdout);
    }

    putchar('\n');

    tot2 = 0;
    min = INFINITY;
    max = -INFINITY;

    for (unsigned n = 1; n <= runs; n++) {
        usleep(delay);
        cur = bench(cycles);

        tot2 += cur;
        avg = tot2 / n;
        if (cur < min) min = cur;
        if (cur > max) max = cur;

        printf("\rTest #2: tot %-9.3f  avg %-7.3f  min %-7.3f  max %-7.3f [ms]", tot2, avg, min, max);
        fflush(stdout);
    }

    puts("\n---");

    if (tot1 < tot2)
        printf("Test #2 ran ~%.3fx slower than Test #1.\n", tot2/tot1);
    else if (tot1 > tot2)
        printf("Test #1 ran ~%.3fx slower than Test #2.\n", tot1/tot2);
    else
        puts("Reality is a simulation.");

    if (avg < 0.5)
        puts("Such low average times are not a good indicator. You should re-run the rest with different parameters.");

    return 0;
}
Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
  • Thank you for your answer and your effort. As you suggested I changed to `acpi-cpufreq` driver and used `userspace` with a fixed freq. However, I still gets different time measurements (see result below). I believe that your answer is the solution to this problem but now the issue is that `cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq` gives varying freqs even though it is supposed to be fix. `Test #1: tot 2578.912 avg 2.579 min 2.410 max 3.175 [ms] Test #2: tot 3247.200 avg 3.247 min 2.579 max 3.525 [ms] --- Test #2 ran ~1.259x slower than Test #1.` – R2pro Mar 02 '20 at 11:04
  • The frequency is continuously varying around 2.8Ghz: 2474640, 2800147, 2352618, 2687954, 2801731, ... Do you know why that happens? – R2pro Mar 02 '20 at 11:25
  • @R2pro if your frequency keeps varying even with a set value and the `userspace` governor then that's an issue with your scaling driver. Try enabling the boost (see the linked page at the top of my answer), even though I doubt that'll change much. At this point it's a matter of painful trial and error to understand why the scaling driver doesn't do what you want it to do. Can't really say much more without personally looking at your computer. – Marco Bonelli Mar 02 '20 at 13:07
  • I have perform the tests in other computer with kernel 5.3, `intel_pstate` enabled and `performance` mode, and that solves the problem! However, the freq is oscillating around the maximum unlike yours. Thanks – R2pro Mar 02 '20 at 16:59