2

this is a followup to an existing thread (http://stackoverflow.com/questions/12724887/caching-in-a-high-performance-financial-application) - I found that it's not the cache that hinders my application. To cut the long story short, I have an application which spends 70 percent of the runtime in one function (15 seconds out of 22). Hence, I would like to cut the runtime of this function as much as possible as the envisaged use of the function is for MUCH larger data (i.e. 22 seconds is not the planned runtime:)

The problem is that VTune's output puzzles me, the code seems to spend a huge deal of time in absolutely unexpected places. I have run out of ideas, so Im posting my project coupled with profiler results here.

Taking a look at the incriminated evaluateExits() function, these things puzzle me:

1/ the function happens to spend 2.2s on calling an inline function that returns 1 regardless of parameters (line 425, this->contractManager->contractCount()). Note: the version where the function returns 1 regardless of params is one of the possible cases so I can't put "contractCount=1" and leave it like that. Can the redirection from a virtual table pointer eat up those 2.2 seconds (contractCount() is a virtual method)?

2/ the function spends 3.3s on min(uint1, uint2) (line 432) despite Im using a version of wmin that should be as CPU friendly as possible.

3/ the function spends 1.6s on line 512 which is a very trivial operation and the function being called is not a virtual one..

So the questions are: why do these three lines of code take so much time? What am I overlooking? And how could I optimize my code to make it run faster? Should I replace the wmin() by a SSE version of min applied to whole arrays?

Any input is much appreciated. Daniel

EDIT: Taking a look into the assembly, I found that in the 1/ case it really is the vfptr that makes the code "slow". I replaced the call of a virtual function by a fastdelegate of Don Clugston's but no performance change whatsoever occured (I have no clue why). Due to Nightingale's comment the attachments should now contain all the files necessary. However, the binary cannot be run successfully, as it connects to shared memory where there are 100's of MB of data.

So, I attach the whole project coupled with VTune's results here and here

Daniel Bencik
  • 959
  • 1
  • 8
  • 32

1 Answers1

0

Daniel,

I wanted to take a look at your VTune results but unfortunately you did not include the binary module for which the result was collected, so I couldn't look into the assembly that should be of greatest value here. Can you re-post your project archive with binary file and debug information file included?

I also attempted to re-build your sources, but a number of header files could not be found:

  • Some Qt headers (I don't have Qt installed and not an expert in doing that)
  • parameterHolder.h file
  • externFloatConsts.h file

So, in order to help it would be good to have these files or the binary that was used to collect the data.

Alexey Alexandrov
  • 2,951
  • 1
  • 24
  • 24
  • Hello Nightingale, many thanks! I have updated the original links. Now there is the whole project (without sdf file, which is large) and also the whole collection of libraries the project uses. – Daniel Bencik Oct 13 '12 at 18:22
  • The binaries are there now, thanks. But when I try to point the VTune to use them, it says that the checksum mismatches. That means that you likely recompiled the binary AFTER you collected the data. But to see the assembly I need the exact binary that was used to collect the data. (I tried pointing the tool to both release and debug executable though profiling with a debug executable is usually a bad idea so I assume you used the release one). – Alexey Alexandrov Oct 14 '12 at 12:33
  • Also, do I understand correct that the *.sdf file is the input file for the benchmark and so I won't be able to run it on my own anyway? If yes, then the executable that matches the result you collected is essential. – Alexey Alexandrov Oct 14 '12 at 12:35
  • Hello Nightingale, thanks for your effort. This is very weird unfortunately:( I downloaded the whole rar file from rapidshare, put it onto a separate partition and opened the VS project. From there, I opened the VTune report and everything was OK, I was able to browse through the assembly without problems. The *.sdf file is a VS generated file for faster function lookup while programming. If I delete it and open the project anew, it's created anew. So I would like to ask you, did you open the VTune report the way I did? Many thanks!! – Daniel Bencik Oct 14 '12 at 13:32
  • I was able to look at the result finally - but I had to re-finalize the VTune results in the assumption that the module is really the same. For some reason the checksum still didn't match for me until I did that. In doing so there is always a risk of getting incoherent binary vs. performance data, but the sanity check seems to show that it's OK. – Alexey Alexandrov Oct 14 '12 at 16:02
  • As for the performance itself. I took a look at the hotspots and I don't see anything particularly bad there. I can see that evaluateExists function retires about 19.5B instructions per one thread (there seems to be three threads in the memory access result, but I believe those are just three separate VTune runs, not really separate threads inside of the application). It retires those 19.5 billion instructions in 16.5 billion clockticks which means the CPI is at about 0.85. That's far from ideal 0.25 but also not that bad. So I would consider a different algorithm or adding threading. – Alexey Alexandrov Oct 14 '12 at 16:08
  • Thank you for the effort! I happenned to acquire a 10+ times speedup of the second most inefficient function ( uiTickCount() ) using SSE, so Im counting on a 4.5s drop in runtime. When it comes to the performance report - could you somehow interpret for me why the CPI is not 0.25? I mean...does it mean that my code is waiting for something most of the time (for data?) or something else? How would threading help with decreasing CPI? Many thanks! – Daniel Bencik Oct 14 '12 at 18:34
  • To achieve CPI of 0.25 you need the CPU to retire 4 instructions every cycle. There are several reasons why this may be hard to get. First, some instructions just take to execute more than 1 cycle due to how they are implemented in HW. For example, multiplications and divisions are quite slow (especially divisions, always try to calculate reciprocal once and then multiply if you can). Second, there may be data dependencies on memory loads such that if the load takes a miss, the subsequent instructions have to wait. Third, even if there are independent instructions to execute, they may... – Alexey Alexandrov Oct 15 '12 at 20:02
  • ...they may still lack execution resources. For example, you might have a couple of divides in a row which do not have interdependencies and in theory can execute in parallel in the out-of-order CPU engine. But since there is usually not many divide units in ALU, you won't get that much of parallelism. So, in short, for loops you need to closely watch the critical data dependency path and make sure there is no long latency instructions (including cache misses) on it. I hope this helps. The task ain't easy. SSE is a good way to bring things "out of order" - pun intended. – Alexey Alexandrov Oct 15 '12 at 20:05
  • I see. Thank you for the clarification. SSE creating mess - I really needed to make one thing quicker and I tested the SSE new code extensively - it just gives the same results as the old (tested) code. Would you think that I need to be careful even more than just testing the SSE code against the old code? Once again, thanks for your patience! – Daniel Bencik Oct 16 '12 at 09:45
  • The usual corner cases with SSE are 1) number of elements that is not a multiple of the vector width; 2) data alignment. For covering the first, make sure you test your code with data that is not just "power of 2" but some different values like a prime number or something like that. For the data alignment - that's trickier. Just need more testing usually, and test in release mode with optimizations where packing is usually more tight. – Alexey Alexandrov Oct 17 '12 at 13:48