why time spent on same function/method call differs much with or without sleep/usleep in the while loop?

Question

my original c++ demo code is shown as follows:

int counter = 0;
while (counter < 5) {
    auto start = std::chrono::high_resolution_clock::now();
    // instance and result are pre-defined local variables
    instance.Search(40.055948, 116.411325, &result);
    auto end = std::chrono::high_resolution_clock::now();
    int64_t cost_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
    std::cout << "cost_us=" << cost_us << std::endl;
    // usleep(100); // case1: sleep 100 us
    // sleep(1);    // case2: sleep 1 second
    //                 case3: no sleep at all
    counter++;
  }s

some people may doubt my call of instance.Search() brings in something unknown, so plz refer to code below:

#include <time.h>
#include <unistd.h>
#include <iostream>
#include <set>
#include <chrono>

void test(const std::set<int>& numbers) {

  for (int counter = 0; counter < 5; ++counter) {
    auto start = std::chrono::high_resolution_clock::now();
    auto it = numbers.lower_bound(5555555);
    auto end = std::chrono::high_resolution_clock::now();
    int64_t cost_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
    std::cout << "cost_us=" << cost_us << std::endl;
    // usleep(100);
    // sleep(1);
  }
}

int main() {
  std::set<int> test_set;
  for (int i = 0; i < 100000000; i++) {
    test_set.insert(i);
  }
  test(test_set);
}

precondition：use perf stat to count cache_miss, instructions and so on...

case 1: when adding usleep(100) at the end of while loop, result after process finishes is:

cost_us=5
cost_us=5
cost_us=5
cost_us=8
cost_us=6
Performance counter stats for './latency_perf_test_sleep_100_us':
    1,785,438     cache-references
      419,583     cache-misses        #   23.500 % of all cache refs
  203,832,235     cycles
  118,093,490     instructions        #   0.58 insn per cycle
   23,198,708     branches
       35,092     faults
          302     migrations

  1.031460583 seconds time elapsed

case 2: when adding sleep(1), the result is:

cost_us=7
cost_us=65
cost_us=21
cost_us=21
cost_us=32
Performance counter stats for './latency_perf_test_sleep_1_sec':
   15,093,302     cache-references
    1,303,941     cache-misses          #   8.639 % of all cache refs
14,759,103,041    cycles
24,548,401,788    instructions          #   1.66 insn per cycle
5,062,488,529     branches
       35,372     faults
        3,444     migrations
  6.033182248 seconds time elapsed

case 3: where no sleep()/usleep() at all, the result is:

cost_us=5
cost_us=2
cost_us=1
cost_us=1
cost_us=1
Performance counter stats for './latency_perf_test_without_sleep':
    1,715,128 cache-references
      420,368 cache-misses              #   24.509 % of all cache refs
  209,238,013 cycles
  130,647,626 instructions              #   0.62 insn per cycle
   25,827,456 branches
       35,092 faults
          362 migrations
          
  1.032256618 seconds time elapsed

As mentioned above, time cost on same function/method call differs much in different case. At first, I tended to think that sleep() causes cache miss (data used by my call). However, after I used taskset to bind my process with a specific cpu core, the difference didn't disappear as I expected.

I also wonder why adding sleep() / usleep() leads to such a sharp increase of the instructions counted by perf stat.

I didn't read any source code of sleep() or usleep(), but I guess that the kernel may do some trick when a process calls sleep() or usleep() (both call nanosleep() inside).

Could anyone explain the reason behind this weird phenomenon? Thanks in advance.

Looks like the cause is somewhere in `instance.Search(40.055948, 116.411325, &result);`. What does this function do? — Simon Kraemer, Jan 26 '21 at 12:50
Something different: Use ```for(int counter = 0; counter < 5; ++counter)``` instead of your while loop because that is the standard to express that something is happening 5 times. — Henk, Jan 26 '21 at 12:51
@SimonKraemer what Search() does is to get a result from a memory-mapped area using binary search. But I really don't think what Search() does matters with same input and same data to search. — Peng Wang, Jan 26 '21 at 12:54
`why adding sleep() / usleep() leads to such a sharp increase of the instructions` soo `I guess that the kernel may do some trick when a process calls sleep() or usleep()` Why is this unsatisfactory answer? — KamilCuk, Jan 27 '21 at 08:11
What standard library (glibc most probably) and what standard library version are you using? What compiler and options are you using? — KamilCuk, Jan 27 '21 at 08:16
@KamilCuk Thx for ur question. I really have no idea what kernel has done when I call sleep(), in other words, how to explain the sharp increase of instructions counted by `perf stat` ? — Peng Wang, Jan 27 '21 at 09:12
@KamilCuk glic version: ```$ ldd --version ldd (GNU libc) 2.17 Copyright (C) 2012 Free Software Foundation, Inc.``` , and c++ 14 is supported by default, all options are here: ```-lidn -lssl -lcrypto -llzma -ldl -lrt -L/home/distcc/gcc-8.3.0/lib64 -static-libstdc++ -ltcmalloc_and_profiler``` , so sorry for the bad format... — Peng Wang, Jan 27 '21 at 09:20
It's likely memory management overhead. Touching so much memory after not touching it for so long likely requires the memory to be marked "recently used". That's a lot of pages to mark. (Can you measure minor page fault counts?) — David Schwartz, Jan 27 '21 at 11:59
(1) I can't reproduce the problem at all. Inserting 1e7 (instead of 1e8) numbers take 6s for me already, and the timing consistently takes 0us because the value of `it` is optimized out. — user202729, Jan 31 '21 at 02:31
@user202729 unbelievable, My colleagues and I can reproduce this phenomenon almost every time. — Peng Wang, Feb 02 '21 at 03:08

score 0 · Answer 1 · answered Jan 27 '21 at 11:08

Could anyone explain the reason behind this weird phenomenon?

ldd --version ldd (GNU libc) 2.17 Copyright (C) 2012

Your sleep looks like this glibc sources from 2012 sysv/linux/sleep.c. It blocks SIGCHLD before calling nanosleep because of a kernel bug (or it's intended SysV behavior? I'm not exactly sure...), see this commit and referenced by it LKML thread.

Most probably the additional overhead is from calling __sig* related functions. To further investigate, profile the code (thinking of gprof) or/and compile your glibc from sources with debug information (or just install the debugging info, if you can) and profile the code then.

Thx, being busy these days, I'll take a test later on. – Peng Wang Feb 02 '21 at 03:09 — Peng Wang, Feb 02 '21 at 03:09

why time spent on same function/method call differs much with or without sleep/usleep in the while loop?

1 Answers1