4

I am processing point cloud data (150k points per cloud). I would like, for each (x,y) point, to compute the distance to a reference point O, and azimuth:

for each point p in points
    dx = p.x - ox
    dy = p.y - oy
    d = hypot(dx, dy)
    az = atan2(dy, dx)

I have a manual SSE implementation. I was hoping to make the code clearer using eigen:

ArrayXf x(points.size()), y(points.size());
for(unsigned i=0; i<points.size(); ++i) {
    x[i] = points[i].x;
    y[i] = points[i].y;
}
const ArrayXf d = (dx.square() + dy.square()).sqrt();
// implement a polynomial approximation to atan (same as the SSE)

However, from my timing experiments, this does not seem to vectorize at all as the time is the same as with the baseline implementation. And I know that SSE2 is enabled since I'm compiling some SSE2 code in the same file.

However, according to the doc, Eigen does take advantage of SSE2 when supported (and AVX in 3.3). Is it only for vectors and matrices operations?

EDIT: I studied the produced assembly code and it does contain some SSE instructions. But it's still slow

EDIT: here is more timing information. I'm looping over 100 frames, about 150k points per frame.

  • naive implementation without atan2: 150ms
  • sse implementation (processing points 4 by 4 and discarding the last few that don't fill a full packet): 30ms
  • eigen implementation using an eigen map: 90ms (diff: 36ms, hypot: 16ms, index: 17ms)

here is my eigen code:

const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > px(&(points[0].x), points.size());
const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > py(&(points[0].y), points.size());

// difference with the origin (ox and oy are floats)
const Eigen::ArrayXf dx = px - ox, dy = py - oy;

// distance and index
const Eigen::ArrayXf d = sqrt(dx.square() + dy.square());

static const float r_res_mult = 1.0f / r_res; //2x faster than div
const Eigen::ArrayXi didx = (d * r_res_mult).cast<int>();
brice rebsamen
  • 664
  • 6
  • 11
  • Do you need the `sqrt` ? Could you just work with the squared distance instead ? – Paul R Jul 07 '15 at 06:13
  • note that the question goes beyond this one problem: I'm studying options to take advantage of SIMD as painlessly as possible, which I will then apply to a lot of code. – brice rebsamen Jul 07 '15 at 18:30

2 Answers2

5

Your main problem is that your data is not in a format friendly for SIMD. You're using an array of structs (xyxyxyxyxyxy...) and then to vectorize your code you do

for(unsigned i=0; i<points.size(); ++i) {
    x[i] = points[i].x;
    y[i] = points[i].y;
}

which converts to a struct of arrays (xxxxxxxx....yyyyyyy...). This conversion is expensive.

A better solution is to already have your points stored as a struct of arrays. An even better solution is to use a hybrid struct of arrays aka array of struct of array. For SSE, assuming you're using single floating point, you would then do xxxxyyyyxxxxyyyy....

Next I suggest you use a SIMD math library. Intel offers SVML which is expensive and closed source. AMD offers libm which is free but closed source. But these libraries both don't play well on their competitor's hardware. The best SIMD library is Agner Fog's Vector Class Library (VCL) . It's open source, free, and optimized to work on both Intel and AMD processors. It's also, like Eigen, just header files and therefore, like Eigen, you don't have to compile and link the library. You just included the header files. Here is how you would do it for SSE or AVX for float (the VLC will emulate AVX on a system without AVX).

//    g++ -O3 -Ivectorclass -msse4.2 foo.cpp
// or g++ -O3 -Ivectorclass -mavx foo.cpp
#include <vectorclass.h>
#include <vectormath_trig.h>

struct Point2DBlock {
    float x[8];
    float y[8];
};

int main(void) {
    const int nblocks = 10; //each block contains eight points
    Point2DBlock aosoa[nblocks]; //xxxxxxxxyyyyyyyy xxxxxxxxyyyyyyyy ...
    float ox = 0.0f, oy = 0.0f;
    Vec8f vox = ox, voy = oy;
    for(int i=0; i<nblocks; i++) {
        Vec8f dx = Vec8f().load(aosoa[i].x) - vox;
        Vec8f dy = Vec8f().load(aosoa[i].y) - voy;
        Vec8f d  = sqrt(dx*dx + dy*dy);
        Vec8f az = atan2(dy,dx);
    } 
}

If you really need hypot. You can build one from the VCL using the pseudo-code from wikipedia.

static inline Vec8f hypot(Vec8f const &x, Vec8f const &y) {
    Vec8f t;
    Vec8f ax = abs(x), ay = abs(y);
    t  = min(ax,ay);
    ax = max(ax,ay);
    t  = t/ax;
    return ax*sqrt(1+t*t);
}

Edit:

Here is a method using an array of structs. This requires some shuffling but this may be negligible compared to the other calculations. The VLC uses template meta programming to determine an efficient method for shuffling.

#include <vectorclass.h>
#include <vectormath_trig.h>

int main(void) {
    const int npoints=80;
    float points[2*npoints]; //xyxyxyxyxyxy...
    float ox = 0.0, oy = 0.0;
    Vec8f vox = ox, voy = oy;
    for(int i=0; i<npoints; i+=16) {
        Vec8f l1 = Vec8f().load(&points[i+0]);
        Vec8f l2 = Vec8f().load(&points[i+8]);
        Vec8f dx = blend8f<0, 2, 4, 6, 8, 10, 12, 14>(l1,l2) - vox;
        Vec8f dy = blend8f<1, 3, 5, 7, 9, 11, 13, 15>(l1,l2) - voy;
        Vec8f d  = sqrt(dx*dx + dy*dy);
        Vec8f az = atan2(dy,dx);
    } 
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • unfortunately storing points as a struct of array is not an option. thanks for pointing to VCL, I've been using a similar one from U Texas, but this one seems better (and has AVX support). – brice rebsamen Jul 07 '15 at 18:18
  • @bricerebsamen, this is a common answer to not using a SoA (or AoSoA). Why is this not an option? If the answer is that it would be too difficult to change your code now then this is one reason you should be aware of this from the beginning so that you plan for it in your next project. – Z boson Jul 08 '15 at 08:53
  • In this case, if `x,y` pairs are really packed, you can load two registers of src data, and `vunpcklps` to get a register of `x`es, and `vunpckhps` to get a register of `y`s. Also, if overflow isn't a concern for your data, you can use the more straightforward formula for `hypot`, and avoid the FP division, which is slow enough to not be completely dwarfed by the `sqrt`. (div and sqrt have about the same throughput, actually, and compete for port 0.) – Peter Cordes Jul 08 '15 at 20:05
  • @PeterCordes, good idea using `vunpck`! I updated my answer based on your suggestion. But `vunpck` mixes the values in a different order (e.g. x0x4x1x5... instead of x0x1x2x3...). That may not be a problem. I provided a solution which keeps the order. Your comment about `hypot` though I think falls under my statement "if you really need hypot". – Z boson Jul 09 '15 at 08:00
  • @bricerebsamen, I updated may answer with a solution using an array of structs. – Z boson Jul 09 '15 at 08:04
  • @Zboson: right, 256b `unpck` does two in-lane unpacks, which is fine if you're going to repack with something like `packusdw`, which can put the lo and hi parts back into order. Otherwise it seems a shuffle is typically required to put elements in order. – Peter Cordes Jul 09 '15 at 08:56
  • @PeterCordes, yeah, the blending/permuting/lookup functions in the VCL are one of the few things I'm a bit sceptical of. It's impressive what Agner has done with template meta programming but nevertheless I'm not sure it's always so efficient (but I have not checked it carefully). In almost every other case though the VCL produces that exact instructions I want. – Z boson Jul 09 '15 at 09:15
  • @Zboson One thing that may or may not be relevant to OP, is the difference in licensing. Eigen (>= 3.1.1) is MPL2 (or LGPL3+ for older versions), whereas VCL is GPLv3. – Avi Ginsburg Jul 09 '15 at 11:44
  • @AviGinsburg, true, but Eigen and VCL serve different purposes. Eigen is a BLAS library whereas the VCL is a SIMD library. The VCL does not include functions for matrix multiplication for example. There is some overlap in functionality for both libraries but the VCL is much more low level. I would use it to build a BLAS library. – Z boson Jul 09 '15 at 13:57
  • @Zboson I didn't mean that they were comparable, just that OP might deem VCL a non-option. I know I have discarded libraries based on the license in the past (FFTW comes to mind) when relevant – Avi Ginsburg Jul 09 '15 at 14:44
  • @AviGinsburg, okay, point taken, I guess I should have mentioned the licensing of SVML, AMD LibM, and VCL in my answer. I honestly don't know what the others offer (I just know they are closed source). – Z boson Jul 09 '15 at 15:09
  • thanks this is very helpful. I'm still doing some experiments and trying to find out why the Eigen Array solution does not vectorize well. If I give up on Eigen as a SIMD wrapper, I will use VCL – brice rebsamen Jul 09 '15 at 19:50
  • @PeterCordes, I looked into `_mm256_unpacklo(hi)_ps` and I don't think they will work as you say. It get's converted to `dx = x0, x4, y0, y4, x2, x6, y2, y6` and `dy = x1, x5, y1, y5, x3, x7, y3, y7`. If you compare across indices you see that it won't work. E.g. `dx0 = x0` and `dy0 = x1`. That's clearly not what you want. If `dx0 = x0` then `dy0` nees to equal `y0`. They don't even contain only `x`es or `y`es as you claim. That would not necessarily be a problem if they at least matched with the correct x and y values but they don't even do that. – Z boson Jul 10 '15 at 09:49
  • Hmm, yeah, they do the opposite operation. Given `x0 x1 x2 ...` and `y0 y1 y2 ...`, `unpck` would produce `x0 y0 x1 y1 x2 y2 ...` Silly me. It's really more of a *pack* than an *unpack*, IMO, even for the integer versions which describe it as combining two halves into a whole (for various element sizes). >. – Peter Cordes Jul 10 '15 at 09:57
3

The copy takes a lot of time. The same or longer than the calculation itself. You don't have to copy the data like that. It's verbose and maybe slower. You can use a map instead, or even use the map directly to calculate. I wrote up a quick demo:

int sz = 15000000;
std::vector<Point> points(sz);

Eigen::Map<ArrayXd, Unaligned, InnerStride<2>> mapX(&(points[0].x), sz);
Eigen::Map<ArrayXd, Unaligned, InnerStride<2>> mapY(&(points[0].y), sz);

mapX = ArrayXd::Random(sz);
mapY = ArrayXd::Random(sz);

auto cpstart = std::chrono::high_resolution_clock::now();
ArrayXd x = mapX;
ArrayXd y = mapY;
ArrayXd d;
auto cpend = std::chrono::high_resolution_clock::now();

auto mpSumstart = std::chrono::high_resolution_clock::now();

d = (mapX.square() + mapY.square()).sqrt().eval();

auto mpSumend = std::chrono::high_resolution_clock::now();

std::cout << d.mean() << "\n";

auto arStart = std::chrono::high_resolution_clock::now();

d = (x.square() + y.square()).sqrt().eval();

auto arEnd = std::chrono::high_resolution_clock::now();

std::cout << d.mean() << "\n";

auto elapsed = cpend - cpstart;
std::cout << "Copy: " <<  elapsed.count() << '\n';
std::cout << "Map: " <<  (mpSumend - mpSumstart).count() << '\n';
std::cout << "Array: " <<  (arEnd - arStart).count() << '\n';

The times that I get are for 100 times the length of your array, I was just too lazy to write a loop to test better. The copy takes about 90ms on my system (VS2012 /Ox in Release (-DNDEBUG)), the mapped version 185ms, and the copied array also about 90ms. The approximate factor of two makes sense for SIMD operations as the mapped version skips every other double. If you have a struct of arrays instead of an array of structs, then the performance of the map should be comparable to that of the copied array.

EDIT: I defined EIGEN_DONT_VECTORIZE and the copied array (almost) doubled its time (as expected). The map, however, remained the same. Curious. May have to do with the map being Unaligned. Or just the fact that there's only room for two doubles and every other one belongs to the wrong map.

EDIT 2 A dumb thought hit me regarding the specific problem posed in the question. You can treat the x, y values as a std::complex<double> and then it's loaded as a single block without memory copies:

Eigen::Map<ArrayXcd> mapC((std::complex<double>*)(&(points[0].x)), sz);
//...
cd = mapC.cwiseAbs2().sqrt().eval();

The timing is only slightly longer than the pre-copied arrays on my computer. You can also subtract the origin as a complex number doing

cd = (mapC - std::complex<double>(ox, oy)).cwiseAbs2().sqrt().eval();
Avi Ginsburg
  • 10,323
  • 3
  • 29
  • 56
  • I tried the map. No more copy, which gains some time, but the next operation (diff with origin) is no longer vectorized (since the map is unaligned), which makes it slower. Overall it's a bit faster though. – brice rebsamen Jul 07 '15 at 18:14
  • timings: copy 35ms, diff 20ms. no copy: diff 36ms – brice rebsamen Jul 07 '15 at 18:28
  • Take advantage of Eigen's lazy evaluation and write it all in one line. I'm not sure, but I think that the `const Eigen::...` *may* be causing eager evaluation. Try `Eigen::ArrayXi didx = (((px-ox).square() + (py-oy).square()).sqrt()* r_res_mult).cast();`. – Avi Ginsburg Jul 07 '15 at 18:48
  • it's about the same, and still much slower than the SSE equivalent. – brice rebsamen Jul 17 '15 at 19:55