why cpu "insn per cycle" is different in similar cpu and How "MONITOR-MWAIT" work in Linux?

Question

Background: I have 2 server, all os kernel version is 4.18.7 which has CONFIG_BPF_SYSCALL=y

I create a shell script 'x.sh'

i=0 
while (( i < 1000000 )) 
do (( i ++ )) 
done

and run command : perf stat ./x.sh

all the shell version is "4.2.6(1)-release"

S1: CPU——Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz，and microcode——0xb00002e and perf stat result

   5391.653531      task-clock (msec)         #    1.000 CPUs utilized          
             4      context-switches          #    0.001 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           107      page-faults               #    0.020 K/sec                  
12,910,036,202      cycles                    #    2.394 GHz                    
27,055,073,385      instructions              #    2.10  insn per cycle         
 6,527,267,657      branches                  # 1210.624 M/sec                  
    34,787,686      branch-misses             #    0.53% of all branches        

   5.392121575 seconds time elapsed

S2: CPU——Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz，and microcode——0xb00002e and perf stat result

  10688.669439      task-clock (msec)         #    1.000 CPUs utilized          
             6      context-switches          #    0.001 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           105      page-faults               #    0.010 K/sec                  
24,583,857,467      cycles                    #    2.300 GHz                    
27,117,299,405      instructions              #    1.10  insn per cycle         
 6,571,204,123      branches                  #  614.782 M/sec                  
    32,996,513      branch-misses             #    0.50% of all branches        

  10.688907278 seconds time elapsed

Question: we can see cpu is similar, os kernel is same, but why the cycles of perf stat is so difference!

Edit: I modify the shell and command: x.sh, turn the loop times smaller to reduce the spend time

i=0
while (( i < 10000 )) 
do
  (( i ++))
done

command, add more details and repeat perf stat -d -d -d -r 100 ~/1.sh

result S1:

     54.007015      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.09% )
             0      context-switches          #    0.002 K/sec                    ( +- 29.68% )
             0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
           106      page-faults               #    0.002 M/sec                    ( +-  0.12% )
   128,380,832      cycles                    #    2.377 GHz                      ( +-  0.09% )  (30.52%)
   252,497,672      instructions              #    1.97  insn per cycle           ( +-  0.01% )  (39.75%)
    60,741,861      branches                  # 1124.703 M/sec                    ( +-  0.01% )  (40.63%)
       451,011      branch-misses             #    0.74% of all branches          ( +-  0.29% )  (40.72%)
    66,621,188      L1-dcache-loads           # 1233.565 M/sec                    ( +-  0.01% )  (40.76%)
        52,248      L1-dcache-load-misses     #    0.08% of all L1-dcache hits    ( +-  4.55% )  (39.86%)
         1,568      LLC-loads                 #    0.029 M/sec                    ( +-  9.58% )  (29.75%)
           168      LLC-load-misses           #   21.47% of all LL-cache hits     ( +-  3.87% )  (29.66%)
<not supported>      L1-icache-loads                                             
       672,212      L1-icache-load-misses                                         ( +-  0.85% )  (29.62%)
    67,630,589      dTLB-loads                # 1252.256 M/sec                    ( +-  0.01% )  (29.62%)
         1,051      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +- 33.11% )  (29.62%)
        13,929      iTLB-loads                #    0.258 M/sec                    ( +- 17.85% )  (29.62%)
        44,327      iTLB-load-misses          #  318.24% of all iTLB cache hits   ( +-  8.12% )  (29.62%)
<not supported>      L1-dcache-prefetches
<not supported>      L1-dcache-prefetch-misses                                   

   0.054370018 seconds time elapsed                                          ( +-  0.08% )

S2:

    106.405511      task-clock (msec)         #    0.996 CPUs utilized            ( +-  0.07% )
             0      context-switches          #    0.002 K/sec                    ( +- 18.92% )
             0      cpu-migrations            #    0.000 K/sec                  
           106      page-faults               #    0.994 K/sec                    ( +-  0.09% )
   242,242,714      cycles                    #    2.277 GHz                      ( +-  0.07% )  (30.55%)
   260,394,910      instructions              #    1.07  insn per cycle           ( +-  0.01% )  (39.00%)
    62,877,430      branches                  #  590.923 M/sec                    ( +-  0.01% )  (39.65%)
       407,887      branch-misses             #    0.65% of all branches          ( +-  0.25% )  (39.81%)
    68,137,265      L1-dcache-loads           #  640.355 M/sec                    ( +-  0.01% )  (39.84%)
        70,330      L1-dcache-load-misses     #    0.10% of all L1-dcache hits    ( +-  2.91% )  (39.38%)
         3,526      LLC-loads                 #    0.033 M/sec                    ( +-  7.33% )  (30.28%)
           153      LLC-load-misses           #    8.69% of all LL-cache hits     ( +-  6.29% )  (30.12%)
<not supported>      L1-icache-loads                                             
       878,021      L1-icache-load-misses                                         ( +-  0.43% )  (30.09%)
    68,442,021      dTLB-loads                #  643.219 M/sec                    ( +-  0.01% )  (30.07%)
         9,518      dTLB-load-misses          #    0.01% of all dTLB cache hits   ( +-  2.58% )  (30.07%)
       233,190      iTLB-loads                #    2.192 M/sec                    ( +-  3.73% )  (30.07%)
        17,837      iTLB-load-misses          #    7.65% of all iTLB cache hits   ( +- 13.21% )  (30.07%)
<not supported>      L1-dcache-prefetches
<not supported>      L1-dcache-prefetch-misses                                   

   0.106858870 seconds time elapsed                                          ( +-  0.07% )

Edit: And I check the /usr/bin/sh md5sum is same，and add bash script header #! /usr/bin/sh,and result like before

Edit: I found some valuable difference use command perf diff perf.data.s2 perf.data.s1

first show some warnning:

/usr/lib64/ld-2.17.so with build id 93d2e4a501823d041413eeb652b89044d1f680ee not found, continuing without symbols
/usr/lib64/libc-2.17.so with build id b04a54c443d36058702ab4060c63f4ab3273eae9 not found, continuing without symbols

and found the rpm version is different.

The perf diff show:

# Event 'cycles'
#
# Baseline    Delta  Shared Object      Symbol
# ........  .......  .................  ..............................................
#
21.20%   +3.83%  bash               [.] 0x000000000002c0f0
10.22%           libc-2.17.so       [.] _int_free
 9.11%           libc-2.17.so       [.] _int_malloc
 7.97%           libc-2.17.so       [.] malloc
 4.09%           libc-2.17.so       [.] __gconv_transform_utf8_internal
 3.71%           libc-2.17.so       [.] __mbrtowc
 3.48%   -1.63%  bash               [.] execute_command_internal
 3.48%   +1.18%  [unknown]          [k] 0xfffffe0000032000
 3.25%   -1.87%  bash               [.] xmalloc
 3.12%           libc-2.17.so       [.] __strcpy_sse2_unaligned
 2.44%   +2.22%  [kernel.kallsyms]  [k] syscall_return_via_sysret
 2.09%   -0.24%  bash               [.] evalexp
 2.09%           libc-2.17.so       [.] __ctype_get_mb_cur_max
 1.92%           libc-2.17.so       [.] free
 1.41%   -0.95%  bash               [.] dequote_string
 1.19%   +0.23%  bash               [.] stupidly_hack_special_variables
 1.16%           libc-2.17.so       [.] __strlen_sse2_pminub
 1.16%           libc-2.17.so       [.] __memcpy_ssse3_back
 1.16%           libc-2.17.so       [.] __strcmp_sse42
 0.93%   -0.01%  bash               [.] mbschr
 0.93%   -0.47%  bash               [.] hash_search
 0.70%           libc-2.17.so       [.] __sigprocmask
 0.70%   -0.23%  bash               [.] dispose_words
 0.70%   -0.23%  bash               [.] execute_command
 0.70%   -0.23%  bash               [.] set_pipestatus_array
 0.70%           bash               [.] run_pending_traps
 0.47%           bash               [.] malloc@plt
 0.47%           bash               [.] var_lookup
 0.47%           bash               [.] fmtumax
 0.47%           bash               [.] do_redirections
 0.46%           bash               [.] dispose_word
 0.46%   -0.00%  bash               [.] alloc_word_desc
 0.46%   -0.00%  [kernel.kallsyms]  [k] _copy_to_user
 0.46%           libc-2.17.so       [.] __ctype_b_loc
 0.46%           bash               [.] new_fd_bitmap
 0.46%           bash               [.] add_unwind_protect
 0.46%   -0.00%  bash               [.] discard_unwind_frame
 0.46%           bash               [.] memcpy@plt
 0.46%           bash               [.] __ctype_get_mb_cur_max@plt
 0.46%           bash               [.] signal_in_progress
 0.40%           libc-2.17.so       [.] _IO_vfscanf
 0.40%           ld-2.17.so         [.] do_lookup_x
 0.27%           bash               [.] mbrtowc@plt
 0.24%   +1.60%  [kernel.kallsyms]  [k] __x64_sys_rt_sigprocmask
 0.23%           bash               [.] list_append
 0.23%           bash               [.] bind_variable
 0.23%   +0.69%  [kernel.kallsyms]  [k] entry_SYSCALL_64_stage2
 0.23%   +0.69%  [kernel.kallsyms]  [k] do_syscall_64
 0.23%           libc-2.17.so       [.] _dl_mcount_wrapper_check
 0.23%   +0.69%  bash               [.] make_word_list
 0.23%   +0.69%  [kernel.kallsyms]  [k] copy_user_generic_unrolled
 0.23%           [kernel.kallsyms]  [k] unmap_page_range
 0.23%           libc-2.17.so       [.] __sigjmp_save
 0.23%   +0.23%  [kernel.kallsyms]  [k] entry_SYSCALL_64_after_hwframe
 0.20%           [kernel.kallsyms]  [k] swapgs_restore_regs_and_return_to_usermode
 0.03%           [kernel.kallsyms]  [k] page_fault
 0.00%           [kernel.kallsyms]  [k] xfs_bmapi_read
 0.00%           [kernel.kallsyms]  [k] xfs_release
 0.00%   +0.00%  [kernel.kallsyms]  [k] native_write_msr
        +45.33%  libc-2.17.so       [.] 0x0000000000027cc6
         +0.52%  [kernel.kallsyms]  [k] __mod_node_page_state
         +0.46%  bash               [.] free@plt
         +0.46%  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
         +0.46%  bash               [.] begin_unwind_frame
         +0.46%  bash               [.] make_bare_word
         +0.46%  bash               [.] find_variable_internal
         +0.37%  ld-2.17.so         [.] 0x0000000000009b13

maybe the glibc difference is the answer!

Edit: Finally, I check the config of BIOS, see that the S2 Server use the power-saving mode, and this is the real answer!

But, a config of BIOS makes me confused which is MONITOR-MWAIT, even though use "Max Performance Mode" with "MONITOR-MWAIT" enable, the performance of S2 is bad also. And use command cpupower idle-info -o, see the cpu use "C-state" which is disable in "Max Performance Mode" already. It must be disable plus "Max Performance Mode", the performance to be better.

The description of "MONITOR-MWAIT" say some OS will check this option to resue the "C-state", and I can't find how the Linux kernel use it to change the "C-state"...

Are you sure that this is the output for the above script? It consists of only 1M very simple iterations, but the output lists billions of branches and 5 seconds of total time. — gudok, Jan 17 '19 at 07:58
yeh, i'm sure! I think, the script is a shell script, so it'll be compile to more instructions, because it will call the kernel and the kernel has more branches. — blue, Jan 17 '19 at 08:08
Do you have the same bash version on both? **Branches and branch-misses are very similar on both, it's only `cycles` that's very different (leading to a different number of branches *per second*** and per cycle, but not a different number of branches or a different miss rate. Note that this probably doesn't make any system calls, it's just `bash` (or some other shell? not `dash` though) running in user-space. `(( ))` is a built-in operator, not something it forks for. — Peter Cordes, Jan 17 '19 at 08:48
Please provide a [mcve] that includes compilation options and compiler versions as well as the specific `perf` command line. — Zulan, Jan 17 '19 at 09:06
@PeterCordes I thind you are right! And all the version of shell is 4.2.6(1)-release. So the key is why E5-2630 can run 2 insn per cycle，i check the detail by [cpu-world](http://www.cpu-world.com/Compare/404/Intel_Xeon_E5-2620_v4_vs_Intel_Xeon_E5-2630_v4.html)，can't understand the difference is so big! — blue, Jan 17 '19 at 09:16
@Zulan 1. create a file "x.sh" `i=0\n while (( i < 1000000 ))\n do (( i ++ ))\n done` 2. perf stat ./x.sh 3. **PS:** because newline can't show, i use \n mark it — blue, Jan 17 '19 at 09:20
Are the bash and libc binaries actually identical, though? You still haven't made the [mcve] Zulan asked for: you're showing the input to /bin/bash, but not how the binary itself was compiled. The same version compiled with different compilers could explain it, if there's one very hot loop that compiled differently. Use `perf record` / `perf report` to find the hotspots, if there are any (and show that in the question). It might just be something spread over a large region of code, though. — Peter Cordes, Jan 17 '19 at 09:30
Do your systems have the same amount of L3 cache? Are you getting cache misses? Is the DRAM setup the same on both machines? Are they configured with the same snoop mode in the BIOS? Is there load on other cores that would compete for memory bandwidth? Is there load on the other logical core (hyperthreading) for one machine, giving the logical core running bash only half of some execution resources, with competitive sharing of others? — Peter Cordes, Jan 17 '19 at 09:32
@PeterCordes The server is no load only for this test, and other config I'll carefully check later, thanks for remind! — blue, Jan 17 '19 at 09:42
A shell script is not a good base for reproducible benchmarking... At least you would have to provide the shell and it's version... Please [edit] your post to include that information, do not put it (only) in comments. — Zulan, Jan 17 '19 at 09:46
Is `/bin/sh` the same binary on both systems? Check with `md5sum /bin/sh` or something. Your question seems to assume that the shell script itself is executed directly by the CPU. It's not. When `perf` starts your script, it's actually running `/bin/sh ./x.sh`. Since you left out the `#!/bin/sh` line, presumably it figures this out on its own, or uses the `system()` C function instead of the kernel doing it inside an `execve()` system call. — Peter Cordes, Jan 18 '19 at 03:03
Now there's a significant difference in the total instruction count. `252M` vs. `260M`. Presumably there's some difference in software (libraries?) or environment variables that leads to `sh` running different instructions. IDK if maybe bash is copying the environment for every built-in command. But the relative difference is larger with this much shorter test, so maybe it's doing something different at start-up. And that somehow leads to making something else slower later? There are minor differences in lots of very small counts. But 8 thousand extra dTLB misses doesn't explain it. — Peter Cordes, Jan 18 '19 at 07:54
Is the output of `set` or `env` different on your two systems? Where are the hotspots if you use `perf record`? — Peter Cordes, Jan 18 '19 at 07:57
Hmm, the `iTLB-loads` number is a lot different. Maybe the slower system is getting a lot of uop-cache misses for some reason. Definitely something to profile for, if you want to dig into bash internals more and figure out where it can slow down (even more than usual) when running this nearly meaningless shell loop. — Peter Cordes, Jan 18 '19 at 08:00
Making more system calls could certainly explain it. With Spectre/Meltdown mitigation enabled, there's significant overhead to a system call, making everything slow for a while after it returns because of TLB evictions. Maybe bash is allocating / freeing memory, and on the slow machine it decides to return that memory to the OS every time instead of keeping it in a free list. If you install debug symbols for libc, can you profile that huge time spent in libc? — Peter Cordes, Jan 18 '19 at 12:02
@blue, can you show similar effect with more reproducible test than bash? For example `echo 2^1234567%2 | perf stat /usr/bin/bc`? Is the OS same on both machines? You may also try adding :u to every event in stat `perf stat -e cycles:u,instructions:u,...` to profile userspace only. Also: try to reboot servers just before test. Are they NUMA or single-socket systems? Also: what is hyperthreading settings on both servers? — osgx, Jan 19 '19 at 14:40

score 2 · Answer 1 · answered Mar 12 '19 at 12:48

I found the answer.

First, let's see the MONITOR/MWAIT option of BIOS in kernel 4.18.7. In that kernel, it's will use intel_idle driver, this driver only check whether the system suports the mwait instruction, and won't care whether the C-state is enabled. Once use the MONITOR/MWAIT instruction, intel_idle driver will be used, and force use the C-state, seems like use powersaving mode.

Second, why insn per cycle is different? Because, service tuned is used, and active profile is "latency-performance", which force_latency is 1us. If you use C-state, will use the C-state level which's latency less than the force_latency;

# cpupower idle-info
CPUidle driver: intel_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 5
Available idle states: POLL C1 C1E C3 C6
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 13034605
Duration: 820867557
C1:
Flags/Description: MWAIT 0x00
Latency: 2
Usage: 349471619
Duration: 344311623672
C1E:
Flags/Description: MWAIT 0x01
Latency: 10
Usage: 237
Duration: 55999
C3:
Flags/Description: MWAIT 0x10
Latency: 40
Usage: 350
Duration: 168988
C6:
Flags/Description: MWAIT 0x20
Latency: 133
Usage: 3696
Duration: 17809893

You'll see only POLL level which's latency less than 1us, and POLL level will force the CPU run with NOP instruction. In this situation, if you use Hyper-Threading technology, will make the speed of execute instruction falling by half. Because the two logic cores will share one ALU, and one of them is running NOP instruction, lead to another one has to wait it.

And if you make the MONITOR/MWAIT option to be disabled, intel_idle driver will be disabled to, so the force_latency of the tuned service will not be used, and the one of logic core will halt, make the another one to use ALU exclusivity.

Finally, thanks everyboy, particularly @Peter Cordes and @osgx, make me check the BIOS, and the command echo 2^1234567%2 | bc is very beautiful!

Hyperthreading shares all four ALUs, but NOPs don't use an ALU execution port. The problem is sharing *front-end* bandwidth (including the narrowest part of the pipeline, 4-wide issue/rename). As well as partitioning some other execution resources, so yeah you definitely want at least one logical core per phys core to be in a sleep state. — Peter Cordes, Aug 15 '21 at 14:22

why cpu "insn per cycle" is different in similar cpu and How "MONITOR-MWAIT" work in Linux?

1 Answers1