2

I was trying out clang compiler and wanted to check its performance vs tradational gcc. I found out that its performance in terms of floating point operations is very bad compared to gcc (almost 30%). I compared the assembly files of my code with clang and gcc and found that gcc was using F*** (e.g fcmpezd) functions while clanf was using V** functions (e.g. vcmpe.f64). Does this affect no of instructions cycles? I believe both instructions are aliases.

Also, in assembly file whenever a function is defined, arguments are pushed on stack. GCC was using stmfd instruction while clang was using push instruction followed by add interuction adding some value (or register content) to stack pointer (sp). Are the two set of instructions consume same cycle - (stmfd) and (push,add)?

I am using vfpv3 as option while compiling with both clang as well as gcc. Also please suggest some good tool which will tell me how many instruction cycles will an instruction consume.

sarda
  • 213
  • 2
  • 7
  • Good practice is to attach simple reproductory C code and exact option sets, you are compiling with. May be all you need is to specify hardfp, or make other clang options tuning. – Konstantin Vladimirov Dec 10 '13 at 13:53
  • the fstuff and vstuff.stuff are aliases, compile to binary and disassemble if that works and check the opcodes. Doesnt mean the two compilers use the exact same code sequences though, but one syntax or the other or a mixture of the two will work (at least with gnu assembler). – old_timer Dec 10 '13 at 14:01
  • you should not take the approach of how much is an instruction worth. granted removing instructions often helps performance but there is more to examining each instruction you examine the sequences, etc. You need to provide some code examples, some idea how you timed them, etc (accurate timing of code is as often the problem as the code under test). – old_timer Dec 10 '13 at 14:04
  • In general clang has trailed gcc in my performance tests as well, I dont use floating point, but normal code. Of course benchmarks are very subjective and can be tuned to make any compiler look good or bad. – old_timer Dec 10 '13 at 14:06
  • Sorry for not pasting the sample code. The code is huge as small code will not give accurate results. You can find the test code here https://github.com/robbertkrebbers/compcert/blob/master/test/c/almabench.c I have cross compiled this test case for arm with options "-march=armv7-a -mfpu=vfpv3-d16 -mfloat=softfp" for both clang and gcc with Optimization level -O3 for both. – sarda Dec 10 '13 at 14:45
  • You're on to a loser using doubles on any ARMv7 device to start with, and doubly losing by using softfp - Not that this explains the performance difference. It would be helpful to know which ARMv7 part you've done this testing on. The results are likely to be very different between, say, TI or Samsung SoC and an Apple A6 or A7 . The latter differs substantially from the standard ARM macro-cell almost everyone else is using. – marko Dec 14 '13 at 01:01
  • @marko I did the testing on Apple A7 – sarda Dec 14 '13 at 11:31
  • This is doubly surprising seeing as GCC almost certainly doesn't have specific optimisations for the architectural differences in the A6 and A7 processors - some of which relate to double precision floating point arithmetic. Also, Apple doesn't use the brain-damage that is softfp ABI - quite what is happening when you link a module that does to the C standard library math functions that presumably use the hard ABI I wouldn't like to say. In case you weren't aware, moving data from NEON to integer registers is really expensive - which softfp does it a lot. – marko Dec 14 '13 at 14:50

0 Answers0