0

I have an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands for around 5 seconds. The counters are offcore_response.all_data_rd.l3_miss.local_dram and offcore_response.all_code_rd.l3_miss.local_dram:

sudo perf stat -e offcore_response.all_data_rd.l3_miss.local_dram,offcore_response.all_code_rd.l3_miss.local_dram -p <PID>

The workloads are: 1) playing a video in VLC and 2) running KDevelop indexer on a large code base. The outputs are shown, below:

VLC:

    Performance counter stats for process id '14617':

         1,621,980      offcore_response.all_data_rd.l3_miss.local_dram                                   
         1,611,825      offcore_response.all_code_rd.l3_miss.local_dram                                   

       4.993841802 seconds time elapsed

KDevelop:

Performance counter stats for process id '23294':

        31,006,390      offcore_response.all_data_rd.l3_miss.local_dram                                   
        10,236,222      offcore_response.all_code_rd.l3_miss.local_dram                                   

       5.095681532 seconds time elapsed

Based on these statistics, the memory access frequency in KDevelop is more than 12 times as much as VLC.

But the IMC counters statistics (retrieved using PCM) are at odds with the above-mentioned performance counters. In the idle system, the total system bandwidth is around 2.65GB (READ: 2.30GB, WRITE: 0.35GB). The total system bandwidth for each workload (ran separately) is as follows:

VLC:

around `8.40`GB (READ:`4.65`GB, WRITE:`3.75`GB)

KDevelop:

around `3.75`GB (READ:`3.15`GB, WRITE:`0.60`GB)

After reducing the idle system bandwidth, the VLC and KDevelop bandwidths will be around 5.75GB and 1.10GB, respectively. This time, the VLC memory access frequency is more than 5 times as much as KDevelop, which shows an obvious conflict.

How can these two outcomes be described?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
TheAhmad
  • 810
  • 1
  • 9
  • 21
  • 1
    `all_data_rd` is measuring reads, not writes. But that's still inconsistent with the memory-controller read bandwidth. The iGPU uses some memory bandwidth itself, and that traffic doesn't come from a core so won't be part of any core's `offcore_response`s. But I'd expect that to mostly be extra writes, with reads for scan-out being the same as for idle, depending on what video-output driver you're using and how much copying it does (e.g. scaling and post-processing with either VDPAU stuff or pixel shaders for higher quality like `mpv` uses by default.) – Peter Cordes Aug 13 '23 at 01:37
  • @PeterCordes If the **extra** accesses originate from the **GPU**, why the `I/O` IMC counter is always **stuck** at a constant value? – TheAhmad Aug 13 '23 at 03:31
  • 1
    Because the GPU and memory controllers are all built-in to the CPU package. The iGPU talks to the memory controller over the internal ring bus, not over an external PCIe link, so it's not I/O. (I'm assuming your laptop is using its iGPU because I'ver heard `intel_gpu_top` doesn't work on a system without an iGPU. If you also have a discrete GPU, disabling the iGPU in the BIOS might give different results. Some hybrid setups I think have the discrete GPU copy back to the iGPU's video RAM for display, which would cost bandwidth, so just rendering on the discrete GPU might not avoid IMC B/W.) – Peter Cordes Aug 13 '23 at 03:32
  • @PeterCordes A **turned-off** screen **reduces** the offcore counter values (**both** user and kernel). How can this be described? Why do the **CPU-initiated** memory accesses **change** in the **turned-off** scenario? – TheAhmad Aug 13 '23 at 06:02
  • Both **code** and **data** counter values are **reduced**. It seems that the **execution** of some pieces of code and their **associated** data accesses **depend** on the screen **state** (i.e., being turned **on/off**). – TheAhmad Aug 13 '23 at 06:15
  • 1
    The "offcore" counters might actually be counting all L3 traffic, including cacheable accesses from the iGPU? I don't know, that's just a guess; some events aren't really core-specific like I think the `unc_*` events, so profiling one program can be disturbed by activity on other cores (or presumably the iGPU). That could perhaps be the case for the `offcore_` events, too? And/or there could be errata that affect things. Or perhaps VLC (or actually the VDPAU library or kernel driver) notices the screen is off and does less work. – Peter Cordes Aug 13 '23 at 06:55
  • @PeterCordes `perf stat -e instructions` shows **little** difference for the turned **on/off** cases (less than **10 percent**). I also checked the **set of accessed** code pages (using a *Pintool*). They seem to be the **same** in both cases. But memory access stats change, **dramatically** (nearly **4x** for **code**, and **2x** for **data**). I will check with **other** apps. – TheAhmad Aug 13 '23 at 07:04
  • The results for `KDevelop` are **OK**. The offcore counter values (**both** user and kernel) are almost **the same** in **both** scenarios (screen **on/off**). – TheAhmad Aug 13 '23 at 07:22
  • I also **counted** the number of the executed instructions using another `Pintool`. Turning **off** the screen has **no** effect on the instructions count for `VLC` (**confirms** the similar result obtained from `Perf`). **Similar** instructions cause **smaller** offcore access counts when the screen is **off**!! – TheAhmad Aug 14 '23 at 04:46
  • 1
    During a **5 second video streaming** in VLC, **sampling** `mem_load_uops_retired.l3_miss:uppp` (period 1000) shows **184** and **1** instance(s) of `ThreadDisplayRenderPicture()` in the backtraces of the **turned-on** and **turned-off** screen scenarios, respectively. This function is called by a **dedicated thread** called `ThreadDisplayRenderPicture`. In other words, it seems that `ThreadDisplayRenderPicture` **stops** rendering pictures when the screen is **off**. I **didn't** do a **detailed** analysis on the sampled backtraces. Perhaps the two scenarios have **more** differences. – TheAhmad Aug 15 '23 at 03:01
  • 1
    To be more **precise**, **184 out of 285** samples **versus** **1 out of 83** samples. – TheAhmad Aug 15 '23 at 05:56

0 Answers0