1

For the function in c, system(), would it affect the hardware counters if you are trying to see how that command you ran performed For example lets say im using the Performance API(PAPI) and the program is a precompiled matrix multiplication application

PAPI_start_counters();
system("./matmul");
PAPI_read_counters();
//Print out values 
PAPI_stop_counters();

I am obviously missing a bit but what I am trying to find out is it is possible, through the use of said counters to get the performance of a program im running. from my tests I would get wild numbers like the ones below. they are obviously wrong, just want to find out why

Total Cycles =========== 140733358872510 
Instructions Completed =========== 4203968 
Floating Point Instructions =========== 0 
Floating Point Operations =========== 4196867 
Loads =========== 140733358872804 
Stores =========== 4204037 
Branches Taken =========== 15774436 
AstroCB
  • 12,337
  • 20
  • 57
  • 73
AshVLSI
  • 45
  • 1
  • 5

3 Answers3

4

system() is a very slow function in general. On Linux, it spawns /bin/sh (forking and executing a full shell process), which parses your command, and spawns the second program. Loading these two programs requires loading the code to memory, initializing all their libraries, executing startup code, etc. Only then will the program code actually start executing.

Because of the unpredictability of disk access and Linux process scheduling, timing system() calls has a very high inherent variability. Therefore, you won't get accurate results even if you use a high-performance counter.

The better solution would be to compile the target program as a library instead. Load it before initializing your counters, then just execute the main function from the library. That way, all the code executes in your process, and you have negligible startup time. Your performance numbers will be much more precise this way.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
2

Do you have access to the code of matmul? If so, it's much more precise to instrument and measure only the code you're interested in. That means you wrap only those instructions (or C statements) in counters that you want to measure.

For more information see:

  • Related discussion here
  • Intel® Performance Counter Monitor here
  • Performance measurements with x86 RDTSC instruction here

As stated above, measuring using PAPI to wrap system() invocations carries way too much process overhead to give you any idea of how fast your math code is actually running.

Community
  • 1
  • 1
Jens
  • 8,423
  • 9
  • 58
  • 78
  • I just used matmul as an example. Im using various compiled programs in system and I am getting similar numbers. I went ahead and wrote the papi code inside of a quicksort program and got similar numbers too. So now I think it may be how I am using papi in addition to the high overhead of system() that was mentioned above. – AshVLSI Apr 26 '14 at 21:02
  • What are you looking for? Execution time of a piece of code, or instruction counts? – Jens Apr 26 '14 at 21:20
  • loads, stores, branches taken, instruction counts, floating point operations, things of that nature. im defaulting to using the perf command. works nicely. – AshVLSI Apr 26 '14 at 22:14
  • That's a tricky one, as it depends on your CPU and if code and the data the code uses has settled in the caches. Memory accesses are harder to measure depending on L1, L2 hit/miss, or page crossing and faulting. Branches depend on predictability and how the CPU works. Then there is the challenge of finding a timer that is fine-grained enough but also takes OS overhead out. And don't forget affinity of your execution, which directly impacts your cache pollution. It's a complex matter :) [This](http://aufather.wordpress.com/2010/09/08/high-performance-time-measuremen-in-linux/) might be a start. – Jens Apr 26 '14 at 22:46
0

The numbers you are getting are odd, but not necessarily wrong. The huge disparity between the instructions completed and the cycles probably indicate that the executable "matmul" is doing a lot of waiting for external processes (e.g. disk I/O) to complete. I do not know the specifics of the msg FP Instructions and FP ops, but if they are displaying those values differently PAPI has a reason.

What is interesting is that the loads and cycles are obviously connected as well as instructions/fp ops and stores.

I would have to know about the internals of "matmul" in order to give you a better description.