I am processing point cloud data (150k points per cloud). I would like, for each (x,y) point, to compute the distance to a reference point O, and azimuth:
for each point p in points
dx = p.x - ox
dy = p.y - oy
d = hypot(dx, dy)
az = atan2(dy, dx)
I have a manual SSE implementation. I was hoping to make the code clearer using eigen:
ArrayXf x(points.size()), y(points.size());
for(unsigned i=0; i<points.size(); ++i) {
x[i] = points[i].x;
y[i] = points[i].y;
}
const ArrayXf d = (dx.square() + dy.square()).sqrt();
// implement a polynomial approximation to atan (same as the SSE)
However, from my timing experiments, this does not seem to vectorize at all as the time is the same as with the baseline implementation. And I know that SSE2 is enabled since I'm compiling some SSE2 code in the same file.
However, according to the doc, Eigen does take advantage of SSE2 when supported (and AVX in 3.3). Is it only for vectors and matrices operations?
EDIT: I studied the produced assembly code and it does contain some SSE instructions. But it's still slow
EDIT: here is more timing information. I'm looping over 100 frames, about 150k points per frame.
- naive implementation without atan2: 150ms
- sse implementation (processing points 4 by 4 and discarding the last few that don't fill a full packet): 30ms
- eigen implementation using an eigen map: 90ms (diff: 36ms, hypot: 16ms, index: 17ms)
here is my eigen code:
const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > px(&(points[0].x), points.size());
const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > py(&(points[0].y), points.size());
// difference with the origin (ox and oy are floats)
const Eigen::ArrayXf dx = px - ox, dy = py - oy;
// distance and index
const Eigen::ArrayXf d = sqrt(dx.square() + dy.square());
static const float r_res_mult = 1.0f / r_res; //2x faster than div
const Eigen::ArrayXi didx = (d * r_res_mult).cast<int>();