Optimizing BVH traversal in ray tracer

Question

In my CPU ray tracer (well, path tracer), the majority of CPU time spent is in the BVH traversal function. According to my profiler, 75% of time spent raytracing is spent in this function and functions that it calls, while 35% of time is spent in the function itself. The other 40% is in the different intersection tests it calls.

Basically, the code does a DFS traversal through all the bounding boxes and triangles it intersects with. It uses a statically allocated array on the stack to hold the nodes to be explored (BVHSTACKSIZE is set to 32, it's never needed most of the space) so that no memory is dynamically allocated. However, it seems crazy to me that 35% of time is spent here. I've spent a while optimizing the code and it's currently at the fastest I've been able to make it, but it's still the largest single bottleneck in my program.

Does anyone have tips for optimizing this even more? I already have a decent BVH construction algorithm so I don't think I'd get any speedup by using a different BVH. Does anyone have tips on how to best do line-by-line profiling on a Mac?

For reference, this code on an example scene takes anywhere from <1 microsecond to 40 microseconds depending on the number of intersections, and the while loop is run for 1 to ~400 iterations (also depending on the number of intersections).

Thanks!

bool BVHAccel::intersect(Ray& ray) const {
  bool hit = false;

  BVHNode* to_intersect[BVHSTACKSIZE];
  int head = 0;
  to_intersect[head++] = root;

  while (head != 0) {
    assert(head < BVHSTACKSIZE);
    BVHNode* cur = to_intersect[--head];

    if (cur->bb.intersect(ray)) { // Does not modify the ray
      if (cur->isLeaf()) {
        for (const auto& primitive : cur->primitives) {
          hit |= primitive->intersect(ray); // Modifies the ray!
        }
      } else {
        to_intersect[head++] = cur->r;
        to_intersect[head++] = cur->l;
      }
    }
  }

  return hit;
}

bool BBox::intersect(const Ray& r) const {
  double txmin = (min.x - r.o.x) * r.inv_d.x;
  double txmax = (max.x - r.o.x) * r.inv_d.x;
  double tymin = (min.y - r.o.y) * r.inv_d.y;
  double tymax = (max.y - r.o.y) * r.inv_d.y;
  double tzmin = (min.z - r.o.z) * r.inv_d.z;
  double tzmax = (max.z - r.o.z) * r.inv_d.z;

  ascending(txmin, txmax);
  ascending(tymin, tymax);
  ascending(tzmin, tzmax);

  double t0 = std::max(txmin, std::max(tymin, tzmin));
  double t1 = std::min(txmax, std::min(tymax, tzmax));

  if (t1 < t0 || t0 > r.max_t || t1 < r.min_t) {
    return false;
  }

  return true;
}

void ascending(double& a, double& b) {
  if (a > b) {
    std::swap(a, b);
  }
}

Can you post a complete, running piece of code? I can gather what your code does (more or less), but it's considerably easier to tinker with something we can copy-paste and run. — Nelewout, Apr 20 '20 at 20:16
I'll expand the example a bit but unfortunately I can't post a self-contained example as I didn't write a lot of the surrounding code and can't release it. — Leo Adberg, Apr 20 '20 at 20:17
But think of the BVH as a binary search tree that it traverses depth-first. The tree is around 10 deep and each tree node traversed requires a BBox intersection while leaf nodes require a specialized intersection (e.g. triangle, sphere). The vast majority (~70%) of time is spent in the DFS traversal and BBox intersections, with DFS traversal taking the largest amount of time. — Leo Adberg, Apr 20 '20 at 20:44
I'll follow this question and think about it a bit more tomorrow. Thanks for the clarification on your intersect methods! — Nelewout, Apr 20 '20 at 21:07
Does your primitive class have a bounding sphere? You should be able to do a quick check of your ray with the sphere before doing the bounding rectangle check. — 1201ProgramAlarm, Apr 21 '20 at 15:04

Remo · Answer 1 · 2020-04-21T18:10:46.863

There seems to be at least one problem with your code. Making a copy of primitive could possible be an expensive operation.

bool BVHAccel::intersect(Ray ray) const {
  bool hit = false;

  BVHNode* to_intersect[BVHSTACKSIZE];
  int head = 0;
  to_intersect[head++] = root;

  while (head != 0) {
    assert(head < BVHSTACKSIZE);
    BVHNode* cur = to_intersect[--head];

    if (cur->bb.intersect(ray)) { // Does not modify the ray
      if (cur->isLeaf()) {
        for (const auto& primitive : cur->primitives) { // this code made a copy of primitives on every call!
          hit |= primitive->intersect(ray); // Modifies the ray!
        }
      } else {
        to_intersect[head++] = cur->r;
        to_intersect[head++] = cur->l;
      }
    }
  }

  return hit;
}

Why it is needed to modify copy of the ray?

Edit 1: Can we assume that BVHNode looks like this?

constexpr auto BVHSTACKSIZE = 32;

struct Primitive;

struct BVHNode {
    std::vector<Primitive> primitives;
    AABB        bb;   
    BVHNode*    r = nullptr;
    BVHNode*    l = nullptr;

    bool isLeaf() const { return r == nullptr && l == nullptr; }
};

You are totally right on both counts and it turns out I messed up when trying to make my code more readable on SO (in reality it uses an existing iterator to iterate over primitives). However, in the real code it doesn't make any copies. Thanks for finding that though! — Leo Adberg, Apr 20 '20 at 23:30

Thomas Caissard · Answer 2 · 2020-04-25T17:19:36.127

I see three improvements you could make.

The first big problem (which is hard) is that you have many conditional branches in your code which will surely slow your CPU as it cannot predict well the code path (which is also true at compilation). For example, I see that you intersect first and then test if the node is a leaf then do the intersection with all the prims. Could you test first if it's a leaf and then do the proper intersection? That would slightly reduce branching.

Secondly, what the memory layout of your BVH? Could you optimize it to make it friendly to you traversal. You could try to look at the number of cache misses that happens during your traversal which would be a good indication of whether your memory has a proper layout or not. Although not directly related, it is nice to now your platform and the underlying hardware. I recommend reading this.

Finally, and this is where you're gonna have the biggest impact on your performance, use SSE/AVX! With a bit of refactoring in your intersection code, you could intersect four bounding boxes at once and therefore have nice boost in your application. You could have a look at what embree does (the intel tracer) especially in the math library.

Also, I just saw that you are using double. Is there a reason for that? Our pathtracer doesn't use double at all as there is not really any cases where you need to have that kind of precision for rendering.

Hope it helps!

EDIT: I made a sse version of your bbox intersection if you wanna try it out. It is partially based on our code but I'm not really sure if it's gonna work you should benchmark and test it!

#include <xmmintrin.h>
#include <emmintrin.h>
#include <smmintrin.h>

#include <cmath>
#include <limits>

constexpr float pos_inf = std::numeric_limits<float>::max();
constexpr float neg_inf = std::numeric_limits<float>::min();

size_t __bsf(size_t v)
{
  size_t r = 0; asm ("bsf %1,%0" : "=r"(r) : "r"(v));
  return r;
}

__m128 mini(const __m128 a, const __m128 b)
{
  return _mm_castsi128_ps(_mm_min_epi32(_mm_castps_si128(a),_mm_castps_si128(b)));
}

__m128 maxi(const __m128 a, const __m128 b)
{
  return _mm_castsi128_ps(_mm_max_epi32(_mm_castps_si128(a),_mm_castps_si128(b)));
}

__m128 abs(const __m128 a)
{
  return _mm_andnot_ps(_mm_set1_ps(-0.0f), a);
}

__m128 select(const __m128 mask, const __m128 t, const __m128 f)
{ 
  return _mm_blendv_ps(f, t, mask); 
}

template<size_t i0, size_t i1, size_t i2, size_t i3>
__m128 shuffle(const __m128 b)
{
  return _mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(b), _MM_SHUFFLE(i3, i2, i1, i0)));
}

__m128 min(const __m128 a, const __m128 b) { return _mm_min_ps(a, b); }

__m128 max(const __m128 a, const __m128 b) { return _mm_max_ps(a, b); }

__m128 vreduce_min(const __m128 v)
{ 
  __m128 h = min(shuffle<1,0,3,2>(v),v);
  return min(shuffle<2,3,0,1>(h),h);
}

__m128 vreduce_max(const __m128 v)
{ 
  __m128 h = max(shuffle<1,0,3,2>(v),v);
  return max(shuffle<2,3,0,1>(h),h);
}

size_t select_min(__m128 valid, __m128 v)
{
  const __m128 a = select(valid, v, _mm_set_ps1(pos_inf));
  return __bsf(_mm_movemask_ps(_mm_and_ps(valid, (a == vreduce_min(a)))));
}

size_t select_max(const __m128 valid, const __m128 v)
{ 
  const __m128 a = select(valid, v, _mm_set_ps1(neg_inf));
  return __bsf(_mm_movemask_ps(_mm_and_ps(valid, (a == vreduce_max(a)))));
}

struct Ray
{
  vec3 o, inv_d;

  float min_t, max_t;
};

struct BBox
{
  vec3 min, max;

  bool intersect(const Ray& r) const;
};

bool BBox::intersect(const Ray& r) const
{
  const __m128 lowerSlab = _mm_mul_ps(_mm_sub_ps(max.m128, r.o.m128), r.inv_d.m128);
  const __m128 upperSlab = _mm_mul_ps(_mm_sub_ps(min.m128, r.o.m128), r.inv_d.m128);

  __m128 tmin = mini(lowerSlab, upperSlab);
  __m128 tmax = maxi(lowerSlab, upperSlab);

  reinterpret_cast<float*>(&tmin)[3] = r.min_t;
  reinterpret_cast<float*>(&tmax)[3] = r.max_t;

  const __m128 maskmin = _mm_castsi128_ps(_mm_cmpeq_epi32(tmin, tmin));
  const __m128 maskmax = _mm_castsi128_ps(_mm_cmpeq_epi32(tmax, tmax));

  const float tNear = abs(tmin[select_max(maskmin, tmin)]); // select the max non NaN value and ensure the result is positive using abs
  const float tFar  =     tmax[select_min(maskmax, tmax)]; // select the min non NaN value
  return tNear <= tFar;
}

Thanks! These are good suggestions. I don't really see how to reduce the amount of branches, and for isLeaf() I still have to do the intersection either way so I don't see how I would change it (also it's false on most iterations so it'll probably be predicted correctly). The memory layout of the BVH is a bit random. Currently each node is malloced on its own, but I'll experiment with storing them in an array. Lastly, I've been told that doubles are as fast as floats (and only slower in regard to memory bandwidth, which there isn't much of here), but I'll try and check it myself. — Leo Adberg, Apr 25 '20 at 01:13
Sticking all the nodes in a contiguous array gave around a 2% speedup on average over a bunch of runs, thanks! — Leo Adberg, Apr 25 '20 at 02:23
You should definitively keep your node in an array or use some kind of memory pool, this will indeed improve your performance w.r.t. the cache lines. Regarding the speed of float versus double, it's really a tricky question and really depends on what you are doing. If you are missing a lot of L1 cache then double will cost you more than float as the hardware has to load more data into the register. If you decide to use SSE then float is a real advantage because the CPU can crunch 4 operations for float whereas you only have two for double. — Thomas Caissard, Apr 25 '20 at 06:36
Also this makes me think: are you using any mathematical function from the std into your intersection code? If so, you should really consider changing them to less robust but faster version of them (I'm mostly thinking of exp and sqrt). — Thomas Caissard, Apr 25 '20 at 06:39
By far the most common intersection is the BBox intersection which only uses really uses *,-,+,<,>. The other intersection functions (triangle and sphere) do use more complicated functions (division, sqrt) but only make up 7% of execution time combined, so I haven't bothered to optimize them yet. — Leo Adberg, Apr 25 '20 at 06:59
Wow, thanks for the SSE code! I should also mention that after implementing CygnusX1's algorithmic suggestion it was easier to implement your suggestion of testing multiple bounding boxes at once. I wrote code to do two at once (both the left and right children) and clang auto-vectorized it for me: https://pastebin.com/DKsQ4tFH — Leo Adberg, Apr 25 '20 at 19:53

CygnusX1 · Answer 3 · 2020-04-23T23:34:40.020

I think there are algorithmic improvements that you can make, before you start thinking about your hardware and squeezing every byte and every cycle out of it.

What I believe you are interested is the first hit on your ray. You are not doing any accumulation from multiple hits along the ray, right? (as if primitives where half-transparent or something). Yes?

So - if the above is true I see a problem with your traversal order. If I am reading your code correctly, you are unconditionally traversing cur->l first, and then cur->r. This is fine if your ray hit the left box first. But - it could be coming from the other side of your scene - then you are performing way more intersections than needed, essentially traversing from back to forth along your ray.

Instead, you want to traverse from front to back along your ray, hoping that you hit something early, allowing you to skip over most of further traversal tests.

Fortunately, the distance to your bounding box is computed already in your intersection function and you can easily check which one is intersected first. It just needs to return a float distance to intersection rather than bool (and for example +infinity on failure). And you need to call it before pushing things onto your to_intersect stack.

So, in pseudocode, I would traverse like this:

while stack not empty:
    cur = pop from top of stack;
    //we already know that we want to enter this node!
    if cur is leaf:
        intersect primitives
    else:
        t_left = intersect bbox of cur->l
        t_right = intersect bbox of cur->r
        if both intersected:
            if t_left < t_right:
                push cur->r, cur->l in that order (so that cur->l will be on top)
            else:
                push cur->l, cur->r in that order (so that cur->r will be on top)
        else if one intersected:
            push only that one
        else:
            push nothing

Once you implement the above, you will probably notice that there is one more improvement to be made. The condition "we already know that we want to enter this node" is not necessarily true if you hit some primitive between the push and pop and your ray.max_t was reduced. To account for that you may want to store a pair (t_enter, node) in your stack. Then, when you pop, you double-check t_enter with your current ray.max_t. This may save you up to h traversal checks, where h is the height of your tree.

Possible pitfall - you probably already know this, but just in case - just because you traverse from front to back does not mean that you can terminate your traversal as soon as you find your first hit. BVH nodes can overlap. That's why the comparison between t_enter and ray.max_t is the way to go.

This is a great idea I didn't think of! I'll try it later and let you know the speedup. — Leo Adberg, Apr 24 '20 at 00:15
Just finished testing it and it gives a ~15% speedup! Thanks, and you'll get the bounty unless someone else has something better up their sleeve. — Leo Adberg, Apr 24 '20 at 08:39
Frankly I was hoping for more than 15%, but I guess this depends heavily on the screne. — CygnusX1, Apr 24 '20 at 12:41
Yep, the scenes are relatively dense and most bounding boxes overlap a bit. — Leo Adberg, Apr 24 '20 at 19:17
Also, 15% is still noteworthy! I'd already done a bunch of optimizations beforehand that were each smaller than 15%. — Leo Adberg, Apr 25 '20 at 01:14
For whatever reason pair initialization was taking a significant portion of the time in the function. Since I didn't need any initial values, I just switched it to two stacks (one for nodes, one for distances) and it was even faster! — Leo Adberg, Apr 25 '20 at 20:16

Optimizing BVH traversal in ray tracer

3 Answers3