2

I perform a performance analysis of an application running on IBM POWER8 server following the CPI breakdown model for POWER8.

I understand that I need to reduce the percentage of stalls caused, for example, by cache misses (PM_CMPLU_STALL_DCACHE_MISS) or branch mispredictions (PM_CMPLU_STALL_BRU). The POWER7 performance analysis tutorial tells that a well-written application has a high final instruction completion percentage (PM_1PLUS_PPC_CMPL).

Do I understand correctly that for POWER8 I need to maximize the percentage for the PM_GRP_CMPL metric? What other PMU-based metrics should I try to maximize?

Alexander Pozdneev
  • 1,289
  • 1
  • 13
  • 31

1 Answers1

1

Pointing out the obvious: you need to optimize your source code to minimize PM_RUN_CYC, the number of cycles it takes for your software task to complete.

The reference you gave breaks down PM_RUN_CYC as PM_CMPLU_STALL + PM_GCT_NOSLOT_CYC + PM_GRP_CMPL.

You'd want to reduce the largest contributor of the three components. Minimize stalls for example by reorganizing your code to reduce cache misses. The "No slot" cycles have to do with branch misprediction and the instruction cache misses.

PM_GRP_CMPL is "Microcoded instructions that span multiple groups will generate this event once per group". Not clear what this tells. In any case, you want to minimize -- not maximize these counts.

Unheilig
  • 16,196
  • 193
  • 68
  • 98
B Abali
  • 433
  • 2
  • 10