Does anyone know Why the perf always show _raw_spin_unlock_irqrestore or some other spin unlock function? The spin_unlock implementation is usually simpler compared with spin_lock. If there is high contending on the spin_lock, does not the perf should show the result in spin_lock ?
3 Answers
When interrupts are disabled, the timer interrupt perf
uses for sampling doesn't trigger. When interrupts are finally enabled (in any irqrestore
function), perf
incorrectly accounts the entire time span as if it was taken by that restore function.
If you take a deeper look using perf annotate
, you'll see most (if not all) of the samples are 1 opcode after the popfq
, which updates the EFLAGS register, enabling interrupts:
_raw_spin_unlock_irqrestore /proc/kcore
Percent│ Disassembly of section load0:
│
│ ffffffff819bd790 <load0>:
│ nop
│ push %rbp
│ mov %rsp,%rbp
│ movb $0x0,(%rdi)
│ nop
│ mov %rsi,%rdi
│ push %rdi
│ popfq
100.00 │ nop
│ pop %rbp
│ ← retq
Linux tries to use an NMI for the performance monitoring interrupt. This change is in mainline since v2.6.31~, but I guess it's meant for bare metal kernels, not for kernels running as VMs (i.e, perf record
on my bare metal Linux machine does not exhibit this issue, while perf record
on a KVM running on my machine does).
See this answer for more details.

- 9,045
- 3
- 34
- 66
what workload are you running? are you sure there is contention in the first place?
irq_restore shows up because reenabling interrupts is costly, but locks which play with interrupts don't show up very often. you are most likely to see it when the machine is largely idle.
for kicks, i ran a trivial workload which contends on a spinlock and unsuprisingly it was the locking routine which was showing up the most:
81.36% [kernel] [k] native_queued_spin_lock_slowpath
4.67% libc-2.17.so [.] __memset_sse2
1.63% [kernel] [k] find_next_zero_bit
0.92% [kernel] [k] _raw_spin_lock
0.81% [kernel] [k] __audit_syscall_exit
0.76% [kernel] [k] __alloc_fd
0.69% [kernel] [k] __slab_free
0.62% [kernel] [k] security_compute_sid.part.13
0.45% [kernel] [k] kmem_cache_free
-
I just ran a random read test case on my local and used perf to check the contending status of the newest block-mq. In the result, except for the native_queued_spin_lock_slowpath path which should be invoked by the io scheduler, there is also _raw_spin_unlock_irqrestore path. Especially in the result generated by the FrameGraph. – JC.wang Aug 23 '17 at 05:23
@employee of the month Fragment of my perf top result
3.32% [kernel] [k] native_queued_spin_lock_slowpath
3.18% [kernel] [k] update_load_avg
3.13% [kernel] [k] __switch_to
3.12% [kernel] [k] native_write_msr
3.02% [kernel] [k] __sched_text_start
2.81% [kernel] [k] _raw_spin_lock
2.20% [kernel] [k] _raw_spin_lock_irqsave
1.97% [kernel] [k] switch_mm_irqs_off
1.70% [kernel] [k] __blkdev_direct_IO_simple
1.69% [kernel] [k] __get_user_pages_fast
And this is my FrameGraph result enter image description here

- 101
- 1
- 5
-
flamegraphs as pictures are not very useful. assuming you were running the same workload and the result is repeatable, you can see the cpu time was *not* being spent in irqrestore. i have a suspicion what's going on. the kernel has dedicated interrupt stacks. when the routine performs irqrestore, all queued interrupts are now being handled. but the reporting mode getting backtraces does not inspect the interrupt stack, thus you only get a dump with irqrestore on top. – Aug 23 '17 at 07:57
-
hmm , it seems that the framegraph is not so accurate. From the result of perf top, the _raw_spin_unlock_irqrestore() didn't cost so much cpu . Thanks for your kindly suggestion. – JC.wang Aug 23 '17 at 09:03