17

Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.

As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.

I came up with the following C++11 code that uses the built-in complex type and vector collection:

#include <complex>
#include <vector>

using namespace std;

// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);

void fft(int sign, vector<complex<double>> &zs) {
    unsigned int j=0;
    // Warning about signed vs unsigned comparison
    for(unsigned int i=0; i<zs.size()-1; ++i) {
        if (i < j) {
            auto t = zs.at(i);
            zs.at(i) = zs.at(j);
            zs.at(j) = t;
        }
        int m=zs.size()/2;
        j^=m;
        while ((j & m) == 0) { m/=2; j^=m; }
    }
    for(unsigned int j=1; j<zs.size(); j*=2)
        for(unsigned int m=0; m<j; ++m) {
            auto t = pi * sign * m / j;
            auto w = complex<double>(cos(t), sin(t));
            for(unsigned int i = m; i<zs.size(); i+=2*j) {
                complex<double> zi = zs.at(i), t = w * zs.at(i + j);
                zs.at(i) = zi + t;
                zs.at(i + j) = zi - t;
            }
        }
}

Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.

As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.

Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?

Xeo
  • 129,499
  • 52
  • 291
  • 397
J D
  • 48,105
  • 13
  • 171
  • 274
  • 4
    *"I'd like to ask for improvements to this code that would make it more idiomatic or efficient?"* - maybe [codereview](http://codereview.stackexchange.com) would be a better place to ask for reviews. – Flexo Apr 12 '12 at 10:15
  • 4
    Most, if not all, standard libraries offer iterator / index debugging in non-optimized / debug mode, which does bounds checking with `operator[]` too. In release mode, they are disabled, so you get the full performance. FWIW, MSVC's library is one of those. And if you're not sure whether another lib does it too, you can just write a helper function that calls `at` in debug and `operator[]` in release mode. – Xeo Apr 12 '12 at 11:04
  • Which other languages did you use? It would be interesting to see a comparison. – Simon Elliott Apr 12 '12 at 12:56
  • A small tip: you can use `std::swap` instead of swapping the values manually. It will always be at least as efficient as manual swapping and for types that have *move constructors* it's going to be significantly faster. (The `std::complex<>` type probably doesn't have move constructors, though, but using `std::swap`is still the preferred way to do it.) – Paul Manta Apr 12 '12 at 13:17
  • Also, in the `` header you have the `M_PI` macro which saves you having to calculate it yourself. – Paul Manta Apr 12 '12 at 13:18
  • @Xeo - I forgot to say that in my answer. I did actually use std::swap. As expected it made no difference to the timings but made the code cleaner and more idiomatic – philsquared Apr 12 '12 at 14:23

1 Answers1

14

I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.

e.g.

  for(unsigned int i=0; i<zs.size()-1; ++i) { 
    ...
    auto t = zs.at(i); 

The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).

There are also some fixed computations being repeated for each loop iteration:

int m=zs.size()/2;
pi * sign
2*j

And the zs.at(i + j) is computed twice.

It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.

Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.

I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.

As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).

Update

I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.

Overall I got a >10% perf. improvement with those suggestions. I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).

I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).

Update 2

I forgot to mention that pre-calculating zs.size() did make a difference.

Update 3

Also forgot to say (until reminded by @xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.

philsquared
  • 22,403
  • 12
  • 69
  • 98