3

I am deveoping a low latency service use c++ on linux. I do two group performance tests:

  1. send 1 request per second, it's average latency is 3.5 microseconds.
  2. send 10 request per second, it's average latency is 2.7 microseconds.

I cannot understand why? I guess call a function frequently, it may run faster. So I do a demo to test it。

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#include <syscall.h>
#include <thread>

using namespace std;

long long get_curr_nsec()
{
    struct timespec now;
    ::clock_gettime(CLOCK_MONOTONIC, &now);
    return now.tv_sec * 1000000000 + now.tv_nsec;
}

long long func(int n)
{
    long long t1 = get_curr_nsec();
    int sum = 0;
    for(int i = 0; i < n ;i++)
    {
        //make sure sum*= (sum+1) not be optimized by compiler
        __asm__ __volatile__("": : :"memory");
        sum *= (sum+1);
    }

    return get_curr_nsec() - t1;
}

bool bind_cpu(int cpu_id, pthread_t tid)
{
    int cpu = (int)sysconf(_SC_NPROCESSORS_ONLN);
    cpu_set_t cpu_info;
    
    if (cpu < cpu_id)
    {
        printf("bind cpu failed: cpu num[%d] < cpu_id[%d]\n", cpu, cpu_id);
        return false;
    }
    
    CPU_ZERO(&cpu_info);
    CPU_SET(cpu_id, &cpu_info);
    
    int ret = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpu_info);
    if (ret)
    {
        printf("bind cpu failed, ret=%d\n", ret);
        return false;
    }
    
    return true;
}
int main(int argc, char **argv)
{
    //make sure the program would not swich cpu
    bind_cpu(3, ::pthread_self());

    //first argv:call times
    //second argv:interval between call function
    int times = ::atoi(argv[1]);
    int interval = ::atoi(argv[2]);

    long long sum = 0;
    for(int i = 0; i < times; i++)
    {
        if(n > 0)
        {
                std::this_thread::sleep_for(std::chrono::milliseconds(interval));
        }
        sum +=  func(100);
    }

    printf("avg elapse:%lld ns\n", sum/ times);
    return 0;
}

The compile command: g++ --std=c++11 ./main.cpp -O2 -lpthread, And I do the below tests:

  1. Call function 100 times without sleep, ./a.out 100 0, output:avg elapse:35 ns
  2. Call function 100 times with sleep 1 ms, ./a.out 100 1, output:avg elapse:36 ns
  3. Call function 100 times with sleep 10 ms, ./a.out 100 10, output:avg elapse:40 ns
  4. Call function 100 times with sleep 100 ms, ./a.out 100 100, output:avg elapse:45 ns
  5. Call function 100 times with sleep 1000 ms, ./a.out 100 1000, output:avg elapse:50 ns

My OS is CentOS Linux release 7.6.1810 (Core) My CPU is Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

I am confused. I do not know why? CPU ? OS? System Call(sleep) ?

Afterwards I use perf to stat branches:

  1. perf stat ./a.out 100 1, there are 241779 branches,7091 branch-misses;
  2. perf stat ./a.out 100 100, there are 241791 branches, 7636 branch-misses.

It seems sleep 100 ms has more branch-misses. But I am still not certain this is the reason, And I don't know why sleep 100 ms has more branch-misses.

Frank Liu
  • 31
  • 2
  • 2
    I don't know what the deciding factor is in this case, but I am not surprised by the qualitative outcome. On the one hand, if you quickly repeat an action you benefit from caching of all kinds, while you don't if you do something rarely. On the other hand, I don't see any reason that doing something rarely should make it faster, except for cases where the CPU is put under so heavy load that throttling and similar effects become relevant. – user17732522 Feb 15 '22 at 03:02
  • Thanks. I considered the CPU cache. But I do not known what's the difference between sleep 1ms and sleep 10ms? – Frank Liu Feb 15 '22 at 03:12
  • Another considerations potentially in play are branch prediction. speculative execution, and pipeline stalling (when a branch prediction is wrong and speculatively executed instruction stream needs to be backed up). Doing something rarely makes it more difficult to accurately predict which path should be taken, and therefore increases chances of stalling the instruction pipeline. The specifics depend on what strategies a particular CPU uses to speculatively predict which path to execute, how many instructions it executes before having to stall, etc. – Peter Feb 15 '22 at 05:01
  • Read about the CPU frequency governor. – n. m. could be an AI Feb 15 '22 at 05:19
  • @ n. 1.8e9-where's-my-share m. Thanks. I execute `cpupower frequency-info`, it shows: available cpufreq governors: performance powersave current policy: frequency should be within 1.20 GHz and 5.00 GHz. The governor "performance" may decide which speed to use within this range. – Frank Liu Feb 15 '22 at 05:46
  • Thanks @Peter I use perf to stat branches. Run ./a.out 100 1, there are 241779 branches,7091 branch-misses; Run ./a.out 100 100, there are 241791 branches, 7636 branch-misses. It seems sleep 100 ms has more branch-misses. But I am still not certain this is the reason. – Frank Liu Feb 15 '22 at 06:12
  • _But I do not known what's the difference between sleep 1ms and sleep 10ms?_ On one hand, there might be no difference if `sleep()` expects a minimal duration to become effective at all. On Windows, this was round about 16 ms under certain conditions in the past (and still might be). On the other hand, suspending a thread for 1 ms frees the CPU core for other tasks for at least 1 ms, doing so for 10 ms does it at least 10 times longer i.e. 10 times more time for other processes and threads to occupy resources which you thread could ask for. – Scheff's Cat Feb 15 '22 at 07:09
  • Sleep more time means other programs have more chance to run(occupy machine resources) while your program sleeping. I think that's why you got such a result. BUT, I think your demo can prove NOTHING about your real issue. We don't know how you sent the request 1 or 10 times. – Yves Feb 16 '22 at 02:41
  • Look at /proc/cpuinfo before and immediately after you run your program. Is there any difference in reported clock frequency? – n. m. could be an AI Feb 16 '22 at 08:54
  • @n. 1.8e9-where's-my-share m. The clock frequency never changed. – Frank Liu Feb 17 '22 at 01:25
  • Notice that `sysconf(_SC_NPROCESSORS_ONLN)` will count isolated cores as well so you might end up piling more threads than processes. – Something Something Sep 23 '22 at 21:40
  • Your code does not compile. https://godbolt.org/z/WKnPshTq1 – Something Something Sep 23 '22 at 21:53

0 Answers0