9

I have written optimized code for an algorithm that computes a vector of quantities. I have timed it before and after various attempts at getting the data computed in the function out of the function. I think that the specific nature of the computation or the nature of the vector of quantities is not relevant. An outline of the code, timings, and details follow.

All code was compiled with the following flags:

g++ -Wall -Wextra -Werror -std=c++11 -pedantic -O3

I have a class like this:

#ifndef C_H
#define C_H

#include <iostream>
#include <iterator>
#include <vector>
Class c {
    public:
        void doWork( int param1, int param2 ) const {
            std::array<unsigned long,40> counts = {{0}};
            // LOTS of branches and inexpensive operations:
            // additions, subtractions, incrementations, and dereferences
            for( /* loop 1 */ ) {
                // LOTS MORE branches and inexpensive operations
                counts[ /* data dependent */ ] += /* data dependent */;
                for( /* loop 2 */ ) {
                    // YET MORE branches inexpensive operations
                    counts[ /* data dependent */ ] += /* data dependent */;
                }
            }
            counts [ /* data dependent */ ] = /* data dependent */;
            /* exclude for profiling
            std::copy( &counts[0], &counts[40], std::ostream_iterator( std::cout, "," ) );
            std::cout << "\n";
            */
        }
    private:
        // there is private data here that is processed above
        // the results get added into the array/vector as they are computed
};

#endif

And a main like this:

#include <iostream>
#include "c.h"
int main( int argc, char * argv ) {
    Class c( //set the private data of c by passing data in );
    int param1;
    int param2;
    while( std::cin >> param1 >> param2 ) {
        c.doWork( int param1, int param2 );
    }
}

Here are some relevant details about the data:

  • 20 million pairs read at standard input (redirected from a file)
  • 20 million calls to c.doWork
  • 60 million TOTAL iterations through the outer loop in c.doWork
  • 180 million TOTAL iterations through the inner loop in c.doWork

All of this requires exactly 5 minutes and 48 seconds to run. Naturally I can print the array within the class function, and that is what I have been doing, but I am going to release the code publicly, and some use cases may include wanting to do something other than printing the vector. In that case, I need to change the function signature to actually get the data to the user. This is where the problem arises. Things that I have tried:

  • Creating a vector in main and passing it in by reference:

    std::vector<unsigned long> counts( 40 );
    while( std::cin >> param1 >> param2 ) {
        c.doWork( param1, param2, counts );
        std::fill( counts.begin(), counts.end(), 0 );
    }
    

    This requires 7 minutes 30 seconds. Removing the call to std::fill only reduces this by 15 seconds, so that doesn't account for the discrepancy.

  • Creating a vector within the doWork function and returning it, taking advantage of move semantics. Since this requires a dynamic allocation for each result, I didn't expect this to be fast. Strangely it's not a lot slower. 7 minutes 40 seconds.

  • Returning the std::array currently in doWork by value. Naturally this has to copy the data upon return since the stack array does not support move semantics. 7 minutes 30 seconds

  • Passing a std::array in by reference.

    while( std::cin >> param1 >> param2 ) {
        std::array<unsigned long,40> counts = {{0}};
        c.doWork( param1, param2, counts )
    }
    

    I would expect this to be roughly equivalent to the original. The data is placed on the stack in the main function, and it is passed by reference to doWork, which fills it. 7 minutes 20 seconds. This one really stymies me.

I have not tried passing pointers in to doWork, because this should be equivalent to passing by reference.

One solution is naturally to have two versions of the function: one that prints locally and one that returns. The roadblock is that I would have to duplicate ALL code, because the entire issue here is that I cannot efficiently get the results out of a function.

So I am mystified. I understand that any of these solutions require an extra dereference for every access to the array/vector inside doWork, but these extra dereferences are highly trivial compared to the huge number of other fast operations and more troublesome data-dependent branches.

I welcome any ideas to explain this. My only thought is that the code is being optimized by the compiler so that some otherwise necessary components of computation are being omitted in the original case, because the compiler realizes that it is not necessary. But this seems to be contraindicated on several counts:

  1. Making changes to the code inside the loops does change the timings.
  2. The original timings are 5 minutes 50 seconds, whereas just reading the pairs from the file takes 12 seconds, so a lot is being done.
  3. Maybe only operations involving counts are being optimized away, but that seems like a strangely selective optimization given that if those are being optimized away the compiler could realize that supporting computations in doWork are also unecessary.
  4. If operations involving counts ARE being optimized away, why are they not optimized in the other cases. I am not actually using them in main.

Is it the case that doWork is compiled and optimized independently of main, and thus if the function has any obligation to return the data in any form it cannot be certain of whether it will be used or not?

Is my method of profiling without printing, which was to avoid the cost of the printing to emphasize the relative differences in various methods, flawed?

I am grateful for any light you can shed.

rg6
  • 329
  • 2
  • 10
  • 2
    Does any data depend on any previous calculated data? If not you might want to split the calculation into a few threads, where each thread works on part of the data-set. – Some programmer dude Apr 28 '13 at 19:47
  • How big is the total vector size? As in, how many bytes? – Mysticial Apr 28 '13 at 19:57
  • How big is your `counts` array? Is it really just 40 elements (which should take a very short amount of time to copy)? I see a `&counts[1156]` in your commented code. – Xymostech Apr 28 '13 at 19:57
  • What's the difference between the disassembly for the cases? – lapk Apr 28 '13 at 20:08
  • 1
    Just a guess. Compiler might unroll some loops. In the case when you have `array` defined inside the function and if your loops, say, from `0` to `array.size()`, compiler might be able to do it. When you pass `array &` as a parameter, compiler has no idea what size it will have and cannot unroll the loops... It's a guess, you need to check code emitted by the compiler in both cases. – lapk Apr 28 '13 at 20:17
  • 1
    What if you pass a pointer to an array as an argument, but keep the local array to do the actual work? Then, at the end, copy the local array to the provided pointer if it isn't null. – Vaughn Cato Apr 28 '13 at 20:21
  • @Joachim No each iteration of the loop is entirely independent, and I will certainly make provisions to parallelize it once the aspects of serial computation are optimized, but I am not ready for that step yet. – rg6 Apr 28 '13 at 21:22
  • @Xymostech There are 40 elements (corrected above). Size is 40*sizeof(unsigned long) = 320 bytes on my machine. – rg6 Apr 28 '13 at 21:34
  • @PetrBudnik The compiler knows the size of the array either way (not in the vector case obviously) because std::array takes its size as a template argument. There are no loops on array except commented output in doWork (not included in profile timings). – rg6 Apr 28 '13 at 21:35
  • @VaughnCato Is this not (almost) functionally equivalent to passing a std:array by reference (which is one of the items I tried, shown in a bullet), the only difference being possibly an extra dereference for accesses. This is so cheap and so rarely appears in my code compared to the other work, that it certainly cannot account for the huge difference in times can it? – rg6 Apr 28 '13 at 21:37
  • @RyanN.Lichtenwalter No, I don't think you correct here. If `array` is a local variable in the function, the compiler can optimize code based on that - the size never changes in the function. If `array` is passed as a parameter, compiler cannot use size as optimization for the function code - it has to emit generic code that will work with all `array`s independently of their size... Once again, did you compare disassembly? To make things obvious, forbid any optimizations and compare times for both versions without optimizations. – lapk Apr 28 '13 at 21:41
  • Is there an operation cheaper than the expensive printing of the array that the compiler is guaranteed not to optimize away with optimization on? Then I do that in the original version of the code, be sure that the operation is trivial and not accounting for 1 minute 30 seconds of time, and find out if necessary optimizations are being optimized out otherwise. I could certainly look at assembly, but the actual code is extremely complex, which is why it's not all posted here, and I am not confident that I can align it well after optimized output. – rg6 Apr 28 '13 at 21:43
  • 3
    @PetrBudnik According to http://en.cppreference.com/w/cpp/container/array, std::array is a constant-size container with size defined at compile time. If std::array & is passed as a parameter, how can the compiler not know the size? – rg6 Apr 28 '13 at 21:45
  • Have you tried using iterators? (For `vector`s, `resize` before passing them, of course.) `std::distance` can tell you the size of the range end-begin (store that in a `const`). Of course, this doesn't adress loop unrolling because of known iterations (array size). You could also try writing a template to cover the two cases 1) size known at compile time 2) size known only at run time. – dyp Apr 28 '13 at 22:00
  • @RyanN.Lichtenwalter Yes, you are right, of course. I mixed it up with your attempt with `vector`... – lapk Apr 28 '13 at 22:12
  • It is possible that accessing an array on the stack is faster than accessing one that is passed in by reference. The compiler knows more about what may or may not happen to that memory when it is on the stack. It's possible that lets it keep more in registers. I would try it and see. – Vaughn Cato Apr 28 '13 at 22:22
  • @DyP Yes, I actually did try sizing a vector and passing in iterators. Thanks for the suggestion. It yielded the same results as the other attempts. I beat my head against the wall last night for about 5 hours trying different things. Anything but local declaration of the static array, whether it uses data ultimately declared on the stack or on the heap, takes about the same time: 7 minutes 20-40 seconds. – rg6 Apr 28 '13 at 23:05

3 Answers3

0

What I would do is pause it a few times and see what it's doing most of the time. Looking at your code, I would suspect the most time going into either a) the innermost loop, especially the index calculation, or 2) the allocation of the std::array.

If the size of counts is always 40, I would just do

  long counts[40];
  memset(counts, 0, sizeof(counts));

That allocates on the stack, which takes no time, and memset takes no time compared to whatever else you're doing.

If the size varies at runtime, then what I do is some static allocation, like this:

void myRoutine(){
  /* this does not claim to be pretty. it claims to be efficient */
  static int nAlloc = 0;
  static long* counts = NULL;
  /* this initially allocates the array, and makes it bigger if necessary */
  if (nAlloc < /* size I need */){
    if (counts) delete counts;
    nAlloc = /* size I need */;
    counts = new long[nAlloc];
  }
  memset(counts, 0, sizeof(long)*nAlloc);
  /* do the rest of the stuff */
}

This way, counts is always big enough, and the point is to 1) do new as few times as possible, and 2) keep the indexing into counts as simple as possible.

But first I would do the pauses, just to be sure. After fixing it, I would do that again to see what's the next thing I could fix.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Hmm. Either I am totally wrong about this or many people have a misconception. Isn't it true that std::array is a *statically* allocated container located on the stack. If not, why does it take size as a template parameter? It's allocation cost should be minimal shouldn't it? I have tried memset, and on any solution where the data leaves the function, it has the same high cost (around 7 minutes 20 seconds). The code should be spending most of its time in the many, many branches, dereferences and so on, but I guess as per some comments, I'll try to figure out the assembly. – rg6 Apr 28 '13 at 22:33
  • Gosh, @Ryan, why guess? Just run it under the debugger and pause it a few times. You can just *see* what the problem is. – Mike Dunlavey Apr 28 '13 at 22:35
  • Side remark: `if(counts)` shouldn't be necessary, as `delete 0;` has defined behaviour IIRC. `nullptr` would also be nice ;) Maybe even `memset` could be dropped in favor of `new long[nAlloc]();` but I'm not sure about the performance of the latter. – dyp Apr 28 '13 at 22:42
  • Ok. I ran it through gdb in original form and passing std::array by reference. In both cases, about 100 pauses and resumes shows a sparsely sampled but apparently roughly uniform distribution of all the statments, so it seems like they are being executed. As I keep saying, the actual code is an extremely complex series of homogeneous-cost dereferences, additions, subtractions, and some more expensive branches. I fear a large sample would be required to learn much, but a hundred each way does suggest all code is being executed. That confuses me even more. – rg6 Apr 28 '13 at 22:54
  • @Ryan: 10 or 20 is all you need, but take the trouble to thoroughly understand each sample. It won't be uniform. It will be concentrated in the innermost loop. Maybe noticeable in the next-to-inner loop. Where you find it, you can examine the assembly language to see if it is calculating the array indexing, or the right-hand-side of the statement. If it makes it easier, you can divide the right-hand-side of each statement into multiple statements. It won't affect performance. If you find yourself inside some strange routine, look up the stack to see where in your code it is. – Mike Dunlavey Apr 29 '13 at 00:18
  • @MikeDunlavey In fact they weren't concentrated in the innermost loop, because while the innermost loop is executed most frequently, it also contains the fewest operations. Nonetheless, this has, as I said, revealed some interesting things about what *is* being executed, and I've been spending the last few hours making sense of it. As soon as I reach a conclusion, I'll update. Thanks! – rg6 Apr 29 '13 at 02:01
  • @MikeDunlavey Point well taken, though! I suppose I just meant that I observed them in proportion to expectation. I think I've tracked down the issue. As soon as I verify, I'll close the question. – rg6 Apr 29 '13 at 02:11
  • @MikeDunlavey This pseudo-profiling approach ended up hinting at the issue. The sheer number of inexpensive statements involved meant it took a lot of time, but it began to look like some statements were not executed. To confirm this, I started running timings including output printing both inside and outside the class. These took 10 minutes 30-50 seconds depending on the approach. As it turned out, the differing approaches to getting the data out were as cheap as I (and we all) expected. The compiler was performing some really selective optimizations on code that it detected had no effect. – rg6 Apr 29 '13 at 08:59
  • @Ryan: You should find a small number of manual samples pointing directly at the problem. It is necessary to thoroughly understand each sample, to know precisely what it is doing at the sample time, and the full reason it's doing it. I do it first with the optimizer turned off, the reason being that if I'm doing something dumb it's easier to see that way. Then I turn on the optimizer after I'm sure I've cleaned out all the dumb stuff. – Mike Dunlavey Apr 29 '13 at 13:52
0

Compiler optimizations are one place to look at but there is one more place that you need to look. Changes that you made in the code can disturb the cache layout. If memory allocated to the array is in a different part of memory each time, number of cache misses in your system can increase, which in turn degrades the performance. You can take a look at hardware performance counters on your CPU to make a better guess about it.

Always Asking
  • 243
  • 2
  • 8
  • Interesting. This had occurred to me, but I assumed that the cost penalty couldn't account for a 25% penalty in such a complicated procedure. I'll try to look into this. – rg6 Apr 28 '13 at 22:37
0

There are times when unorthodox solutions are applicable, and this may be one. Have you considered making the array a global?

Still, the one crucial benefit that local variables have is that the optimizer can find all access to it, using information from the function only. That makes register assignment a whole lot easier.

A static variable inside the function is almost the same, but in your case the address of that stack array would escape, beating the optimizer once again.

MSalters
  • 173,980
  • 10
  • 155
  • 350