1

I'm currently working with pointclouds alot, and I have implemented a segmentation algorithm that clusters points with a specific maximum distance into segments.

To optimize that, I've given each segment an axis-aligned bounding box,to check if the given point could possibly be a match for a segment, before looking closer and iterating over the points and calculating distances (I actually use an octree for this, to prune a majority of the points away.)

I've run my program through gnuprof and, that's the result:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 52.42      5.14     5.14 208995661     0.00     0.00  otree_node_out_of_bounds
 19.60      7.06     1.92 189594292     0.00     0.00  otree_has_point_in_range
 11.33      8.17     1.11   405834     0.00     0.00  otree_node_has_point_in_range
  9.29      9.08     0.91   352273     0.00     0.00  find_matching_segments
 [...]

As you can see, the majority of computation time is spent in otree_node_out_of_bounds which is implemented as follows:

int otree_node_out_of_bounds(struct otree_node *t, void *p)
{
    vec3 *_p = p;
    return (_p->x < t->_llf[0] - SEGMENTATION_DIST 
        || _p->x > t->_urb[0] + SEGMENTATION_DIST
        || _p->y < t->_llf[1] - SEGMENTATION_DIST 
        || _p->y > t->_urb[1] + SEGMENTATION_DIST
        || _p->z < t->_llf[2] - SEGMENTATION_DIST 
        || _p->z > t->_urb[2] + SEGMENTATION_DIST);
}

where SEGMENTATION DIST is a compile time constant, to allow gcc to do some constant folding. _llf and _urb are of type float[3] and represent the bounding box of the octree.

So, my question basically is, is it possible to do some sneaky optimization on this function, or, to be more general, is there a more efficient way to do bounds checking on AABBs, or to phrase it even differently, can I speed up the comparison somehow by using some C/gcc magic?

If you need more information to answer this question, please let me know :)

Thanks, Andy.

Andreas Grapentin
  • 5,499
  • 4
  • 39
  • 57

2 Answers2

2

This is a tiny leaf function that is called a huge number of times. Profiling results always over-represent the cost of these functions because the overhead of measuring the calls is large relative to the cost of the function itself. With normal optimization the cost of the entire operation (at the level of the outer loops that ultimately invoke this test) will be a lower percentage of the overall runtime. You may be able to observe this by getting that function to inline with profiling enabled (eg with __attribute__((__always_inline__))).

Your function looks fine as written. I doubt you could optimize an individual test like that further than you have (or if you could, it would not be dramatic). If you want to optimize the whole operation you need to do it at a higher level:

  • You could try another structure (e.g. kd-tree instead of octree) or an entirely new algorithm that takes advantage of some pattern in your data.
  • You could invert the loop from "for each point check otrees" to "for each otree check points", which lets you re-use bounds data over and over.
  • You can ensure you're accessing data (points, probably) in the most efficient way (i.e. sequentially rather than randomly jumping around).
  • With a restructured loop you could use SSE to execute multiple bounds tests in a single instruction (with no branching!).
Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • Thanks for your suggestions. I've got a few notes to each of them: 1.) I thought about using kdtrees, in fact, I just failed at implementing them :) 2.) we're talking about potentially billions of points here, so I doubt reversing the loop will be faster. 3.) I do that, I guess 4.) I didn't get that, but it sounds interesting. can you provide more details? Sidenote: I don't think the profiler over-represents the cost of this function, as **over 50%** of my entire runtime is spent there :) – Andreas Grapentin Feb 04 '13 at 07:40
  • I've inlined and reprofiled the program as you suggested, and the performance impact of the function is indeed reduced, as could be expected because of the removed function calls and returns. However, there's still a considerable amount of time spent in that function. – Andreas Grapentin Feb 04 '13 at 07:48
  • @AndreasGrapentin: As I said in my answer, the most effective way to speed up your result is going to be to avoid as many of these "in bounds" comparisons as you can by using another data structure. Failing that, reorganizing the loop can have a huge effect on performance. Without a complete problem description (for algorithm suggestions) or code for the hierarchy above your function (for loop structure and SSE suggestions) it's hard to be more specific. – Ben Jackson Feb 04 '13 at 08:04
  • @AndreasGrapentin: Also, I've been *exactly* where you are with a problem where I hand-optimized an inner test and got a 10% speedup, and then realized the algorithm calling it was accidentally O(n) instead of O(log n) due to mistaken assumptions. When I fixed that, the entire program (including lots of other operations) finished *27 times* faster. – Ben Jackson Feb 04 '13 at 08:06
  • I've been thinking long and hard about the algorithm in question, and the current state of the program is indeed much faster than my initial approach, But I think I'm hitting a wall there. I can't think of an intuitive approach to improving the general computation. However, I appreciate your pointers and will look into the things you suggested. Thanks again :) – Andreas Grapentin Feb 04 '13 at 08:13
  • Also, I'll reward you the bounty as soon as it's unlocked. – Andreas Grapentin Feb 04 '13 at 08:15
-1

It looks good to me. The only micro optimisation I can think of is declaring *_p as static

Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • Thanks for the suggestion. However, when I declare _p as `static vec3 *_p`, gcc tells me `otree.c: In function 'otree_node_out_of_bounds': otree.c:74:2: error: initializer element is not constant` ` – Andreas Grapentin Jan 29 '13 at 09:45
  • Sorry, my bad. your suggestion works as expected, I'll check the profiler :) – Andreas Grapentin Jan 29 '13 at 09:50