2

I have a program with an inner loop that needs to be very very fast due to the number of iterations it performs. To profile this code I have been using valgrind/callgrind. I find it to be a wonderful tool. Unfortunately my efforts at optimizations have taken me into using newer instructions sets like fma (intel) / fma4 (amd) and whenever I use these callgrind blows up because it does not support those instructions.

I understand that one solution is to get the simply not use those intrinsics, and make compiler to emit code that does not contain those instructions, but honestly I see no point in that, I want to profile the code as it is, not as valgrind can handle it.

This brings me to my question. Are there any open source or free profilers out there that can do as good a job as valgrind/callgrind? I know about gprof, but as I understand it, it essentially just stops the program at intervals and sees where it is and counts the number of times it sees each thing, which is like tearing out an eye compared to what callgrind gives me.

James Matta
  • 1,562
  • 16
  • 37
  • That's odd, according to the 3.9 release notes it states "Support for Intel AVX2 instructions. This is available only on 64 bit code." The bugtracker knows maybe more: https://bugs.kde.org/show_bug.cgi?id=273475 – usr1234567 Apr 02 '14 at 05:08
  • Hmm I wonder if it is because it is AMD instead of intel and so I am not using the normal fused multiply add instructions but the amd version fma4? – James Matta Apr 02 '14 at 05:11
  • Sounds reasonable, but I don't know. Ask the Valgrind guys or open a bug report. Have you tried the latest development branch? It could be already fixed, as the 3.9 release was in October. – usr1234567 Apr 02 '14 at 05:18
  • 2
    @user2799037: In fact, I just figured out it is definitely the FMA4 instructions, switching to an older version of the code but leaving the compiler flag -mfma4 makes the error happen again (with aggressive optimization in place on). Turning off the flag makes the code work in callgrind again. I will test the dev version and see what happens, if all else fails, I will open a bug report, but it would still be nice to have another program to fall back on while that is being fixed. – James Matta Apr 02 '14 at 05:21
  • You can use Quantify. Its not free/open-source, but even the evaluation version is decent as long as you are doing profiling just for you. It wont allow to save the data (free one) but shows very good reports. – peeyush Apr 02 '14 at 05:22
  • 1
    @user2799037: It's present in the current dev version too. Writing up a bug report now. – James Matta Apr 02 '14 at 05:50

1 Answers1

3

I would probably stick with valgrind/callgrind:

Trying out the compile flags mavx and mfma4 causes issues for me too on different processors: FMA4 is primarily an AMD feature, although support for it is filtering into Intel chips, whereas AVX is primarily an Intel feature (with support being filtered into AMD chips) however in benchmarks AVX on AMD, when supported, actually performs slower than using SSE1/2/3/4 (FMA4 fills in for SSE51, 2, 3).

Using both optimisations is perhaps not the best approach and may well lead to the behaviour you are experiencing, as they effectively stand in opposition of each other, being primarily designed for specific brands of processors. Try removing FMA4 if you are compiling for an Intel CPU that supports AVX and using FMA4 if compiling for an AMD processor that supports FMA4.

That having been said, the compiler will not allow the combination of multiply and add into an FMA because that would reduce 2 roundings to 1 rounding in FMA, hence, you would need to use a relaxed floating point model (something like -ffast-math *) or fail in IEEE floating point compliance by converting a lutiply and add to an FMA. Not sure how it works when you call the intrinsics specifically, but the compiler might not optimise them based on flags as they are very specific instructions.

The FMA flag (mfma4) on my Intel CPUs produces the same result reliably, with valgrind throwing similar hissy fits to the one you have posted, however it behaves fine on the AMD CPU machines, (I take it your processor is an Intel?):

vex amd64->IR: unhandled instruction bytes: 0xC4 0x43 0x19 0x6B 0xE5 0xE0 0xF2 0x44
vex amd64->IR:   REX=0 REX.W=0 REX.R=1 REX.X=0 REX.B=1
vex amd64->IR:   VEX=1 VEX.L=0 VEX.nVVVV=0xC ESC=0F3A
vex amd64->IR:   PFX.66=1 PFX.F2=0 PFX.F3=0

This is from the test code below.

FMA3 Intrinsics: (AVX2 - Intel Haswell)

_mm_fmadd_pd(), _mm256_fmadd_pd()
_mm_fmadd_ps(), _mm256_fmadd_ps()

and many many more besides....

FMA4 Intrinsics: (XOP - AMD Bulldozer)

_mm_macc_pd(), _mm256_macc_pd()
_mm_macc_ps(), _mm256_macc_ps()

and many many more besides....

Notes

FMA offers support for features that were scheduled to be part of SSE5 such as:

XOP: Integer vector multiply–accumulate instructions, integer vector horizontal addition, integer vector compare, shift and rotate instructions, byte permutation and conditional move instructions, floating point fraction extraction. FMA4: Floating-point vector multiply–accumulate. F16C: Half-precision floating-point conversion.

Test Code

float vfmaddsd_func(float f1, float f2, float f3){
  return f1*f2 + f3;
}


int main() {
  float f1,f2,f3;
        f1 = 1.1;
        f2 = 2.2;
        f3 = 3.3;
        float f4 = vfmaddsd_func(f1,f2,f3);
        printf("%f\n", f4);
        return 0;
}
GMasucci
  • 2,834
  • 22
  • 42
  • I am using FMA for vector multiply accumulate. I have several versions of the function written and I do not get what you mean by both might not be good. As I understand it, without fma, I cannot use _mm256_fmadd_ps(a,b,c) commands nor the sse version _mm_fmadd(a,b,c). Now I have noticed that my sse implementation tends to do a bit bit better on AMD architecture than the avx, but the performance of both should improve with it since it lets me combine a multiply and add operation, no? – James Matta Apr 02 '14 at 09:57
  • FMA4 is supported on AMD after, and including, Bulldozer (2011) cores, and on Intel since the Haswell core CPUS. To optimise for each specific CPU type you would have to implement different code for each, or allow the compiler to try and optimise generic code based on flags you set. Personally I would go for the separate code wrapped in #defines to make different code bodies be used dependant on the flags you select for the compiler. – GMasucci Apr 02 '14 at 10:17
  • I actually have separate code split up using specializations of a template class though the compute grid I am targeting is pretty much solely AMD systems with FMA4 and AVX. Anyways, the whole point of my question was not about which instruction set or optimization scheme was better. It was about the fact that callgrind was failing with fma instructions and I was looking for a profiler that could give similar detail and information that would also handle that properly. – James Matta Apr 02 '14 at 10:24
  • It is the FMA4 flag that causes it as you are on an Intel CPU I believe, I get the same error on my Intel CPU machine, and I get a variant when I use `mavx` on my AMD CPU. Will look into it further and get back to you asap:) – GMasucci Apr 02 '14 at 10:29
  • @james-matta Just posted an update, hopefully it is of use to you:), unless of course you are on an AMD CPU in which case I will need to read up on things tonight:( – GMasucci Apr 02 '14 at 10:40
  • Look, you aren't getting the point. I know the source of the problems valgrind/callgrind is having. I have known it since before you wrote your answer (look at the comments below the main question). I also surely know how to use the instructions the flag enables, I use the xmmintrin.h and immintrin.h headers. My question from the start was, what other profilers are there that can do as good a job as callgrind (or come close). I am currently sitting a night shift babysitting a temperamental DAQ and get work done, I know my temper is short, but you are not answering the question that I asked! – James Matta Apr 02 '14 at 10:57
  • I am now editting the question to try to make things clearer. – James Matta Apr 02 '14 at 10:59
  • Apologies and you have my sympathies will get on it:) – GMasucci Apr 02 '14 at 11:19
  • Hi again, the only comparative tool I can think of immediately is [Purify](http://en.wikipedia.org/wiki/IBM_Rational_Purify) and it is not cheap, I will keep hunting.... – GMasucci Apr 02 '14 at 11:33
  • BTW is it Linux or windows you need a valgrind equivalent for? (I presumed linux...) – GMasucci Apr 02 '14 at 11:36
  • Definitely linux. Also, if you can't think of anything there is no need to go searching for it unless you are super curious, it might very well be the case that no such alternative exists. I was just hoping someone would know so I didn't have to try every alternative myself. – James Matta Apr 02 '14 at 11:43
  • From what I can remember Purify is very close to valgrind, but valgrind is about the most versatile about, I will ask about and see if we cant get you an alternative. :) – GMasucci Apr 02 '14 at 11:53