Parallel octree construction on the CPU

Question

I did a simple implementation of an octree. Now I'm trying to make the construction of the tree parallel on the CPU.

First I tried to make the step of adding points to the children of the tree parallel (being the most costly step in the construction), but due to having to lock the vector/list each time I add a point it didn't gain any performance benefits.

Now I'm trying to make the construction of each node in the tree parallel. The idea is simple and should be straight forward as there is no intersection between the nodes. I simply need to assign each thread a node to work on. The issue is that it is a nested top-down implementation so I'm not sure what is the best way to implement this.

I'm using C++ and OpenMP. I tried writing this inside the build function:

 omp_set_nested(1);
#pragma omp parallel for schedule(dynamic)
    for (int i = 0; i < child_count; i++) {
        _child[i]->build(threshold, maximumDepth, currentDepth + 1);
    }

But the performance became way worse than the sequential one.

Then I tried to parallelize just the top 8 nodes (of the root node). This gave me a performance gain of X2-X3. However it depends heavily on the scene. if my scene is way too unbalanced the parallelism will have very few benefits as 7 of the top 8 nodes could be almost empty and one node has all the other points.

Any thoughts on how to do this correctly?

Without seeing more code it is hard to tell. It seems that nested parallelism is not advantageous for you. It may be worth trying `tasks` instead of top node `parallel for` to have more bits of work. Also use `if` or `final` clause to avoid oversubscription. — Laci, Jun 29 '21 at 06:16

Laci · Accepted Answer · 2021-06-29T07:00:21.253

0

Use tasks instead of nested parallelism. Try this code:

#pragma omp parallel
#pragma omp single nowait
{       
            \\ first call to build function       
}

Inside build function use the following code. Note that currentDepth> XXX is a condition to stop creating more tasks.

for (int i = 0; i < child_count; i++) 
{
   #pragma omp task final( currentDepth> XXX ) mergeable \
   default(none) firstprivate(threshold, maximumDepth, currentDepth)
     _child[i]->build(threshold, maximumDepth, currentDepth + 1);
}

If child_count > 1 the following code may be faster:

for (int i = 0; i < child_count-1; i++) 
{
   #pragma omp task final( currentDepth> XXX ) mergeable \
   default(none) firstprivate(threshold, maximumDepth, currentDepth)
     _child[i]->build(threshold, maximumDepth, currentDepth + 1);
}
_child[child_count-1]->build(threshold, maximumDepth, currentDepth + 1);

edited Jun 29 '21 at 07:00

answered Jun 29 '21 at 06:49

Laci

2,738
1
13
22

Thank you for your answer and sorry for being late. Turns out the task feature in OpenMP is not in OMP 2.0 which is the latest version supported by VS2019. My work is tied up to Cmake and other stuff so I don't want to switch software right now. Do you think there is an equivelant way to do this with TBB that is also as simple as this one? I know there are tasks there but I'm not sure if they would work with a nested design like the one I have. – Maher Raess Jul 04 '21 at 19:02
I do not have experience with TBB. How about integrating Intel compiler in VS2019? It has OpenMP 5.0 support. Another alternative is /openmp:llvm option, but it is experimental only at the moment, but has some support for tasks. – Laci Jul 04 '21 at 19:31
looks like parallel_for in tbb fully supports nested parallelism so simply writing: tbb::parallel_for(0, child_count, [&](int i) instead of the original for did the trick and gave me very good results! Thanks for the suggestions tho :) – Maher Raess Jul 04 '21 at 19:46

Parallel octree construction on the CPU

1 Answers1