How to optimize omp parallelization when batching

Question

I am generating class Objects and putting them into std::vector. Before adding, I need to check if they intersect with the already generated objects. As I plan to have millions of them, I need to parallelize this function as it takes a lot of time (The function must check each new object against all previously generated).

Unfortunately, the speed increase is not significant. The profiler also shows very low efficiency (all overhead). Any advise would be appreciated.

        bool
   Generator::_check_cube (std::vector<Cube> &cubes, const cube &cube)
    {
        auto ptr_cube = &cube;
        auto npol = cubes.size();
        auto ptr_cubes = cubes.data();
    
    const auto nthreads = omp_get_max_threads();

    bool check = false;

#pragma omp parallel shared (ptr_cube, ptr_cubes, npol, check)
    {
#pragma omp single nowait
        {
            const auto batch_size = npol / nthreads;
            for (int32_t i = 0; i < nthreads; i++)
            {
                const auto bstart = batch_size * i;
                const auto bend = ((bstart + batch_size) > npol) ? npol  : bstart + batch_size;

#pragma omp task firstprivate(i, bstart, bend) shared (check)
                {
                    struct bd bd1{}, bd2{};
                    bd1 = allocate_bd();
                    bd2 = allocate_bd();

                    for (auto j = bstart; j < bend; j++)
                    {
                    bool loc_check;
#pragma omp atomic read
                    loc_check = check;
                    if (loc_check) break;

                        if (ptr_cube->cube_intersecting(ptr_cubes[j], &bd1, &bd2))
                        {
#pragma omp atomic write
                            check = true;
                            break;
                        }
                    }
                    free_bd(&bd1);
                    free_bd(&bd2);
                }
            }
        }
    }
    return check;
}

UPDATE: The Cube is actually made of smaller objects Cuboids, each of them have size (L, W, H), position coordinates and rotation. The intersect function:

 bool
Cube::cube_intersecting(Cube &other, struct bd *bd1, struct bd *bd2) const
{
    const auto nom = number_of_cuboids();
    const auto onom = other.number_of_cuboids();

    for (int32_t i = 0; i < nom; i++)
    {
        get_mcoord(i, bd1);

        for (int32_t j = 0; j < onom; j++)
        {
            other.get_mcoord(j, bd2);
            if (check_gjk_intersection(bd1, bd2))
            {
                return true;
            }
        }
    }
    return false;
}

//get_mcoord calculates vertices of the cuboids

   void
    Cube::get_mcoord(int32_t index, struct bd *bd) const
    {
        for (int32_t i = 0; i < 8; i++)
        {
            for (int32_t j = 0; j < 3; j++)
            {
                bd->coord[i][j] = _cuboids[index].get_coord(i)[j];
            }
        }
    }

inline struct bd
allocate_bd()
{
    struct bd bd{};

    bd.numpoints = 8;

    bd.coord = (double **) malloc(8 * sizeof(double *));

    for (int32_t i = 0; i < 8; i++)
    {
        bd.coord[i] = (double *) malloc(3 * sizeof(double));
    }
    return bd;
}

Typical values: npol > 1 million, threads 32, and each npol Cube consists of 1 - 3 smaller cuboids which are directly checked against other if intersect.

Of course the gains are small, every worker will check the intersection while everything and you also have an atomic write. Can you provide more details in your post about cube and what you know about them, and the intersection function ? It seems to me that you need to revise the algorithm (find an alternative to interesection check maybe) in order to decrease its complexity and also seek to decrease the dependencies so you may parallelize more. — Soleil, Mar 02 '22 at 13:39
Also, with your profiler did you spot where/how is the time spent ? Are you cpu bound or write bound ? — Soleil, Mar 02 '22 at 13:42
If I understand correctly, I am write bound, i.e. I have problems with accessing memory, i.e. all time is spent on that. — Alexander S, Mar 02 '22 at 15:21
Also, memory allocation and free can be expensive (they seems to be allocated in the heap, not the stack ! Whare are their size ?). Can you share more about `nthreads` and the other details ? *all time is spent on that* ie., `#pragma omp atomic write check = true;` ? if not what else ? What is happening with `bd1` and `bd2` ? I don't see what you are doing with them outside this code and with `cube_intersecting`; are they really useful ? — Soleil, Mar 02 '22 at 15:39
I believe there is a typo in your code after `if (loc_check) break;` with a wrong `}`. And please share `struct bd` defintion. — Soleil, Mar 02 '22 at 15:41
yes, sorry, mistype. I have updated the topic with additional functions. The size of the max box I am trying to count is <100 Mb. No, atomic write does not take too much time. If I understand correctly, the problem is with vector container and threads are waiting to access it. But my understanding is limited. Thank you for your attention. — Alexander S, Mar 02 '22 at 16:32
Note that your OpenMP code is not correct: you have to use `#pragma omp atomic read` before `loc_check = check;` . I also suggest you to use `default(none)` clause and explicitly specify the sharing attributes of all your variables. — Laci, Mar 02 '22 at 18:42
I corrected the mistake with atomic read. Typical npol is up to one million, nthreads 32. Generator is called very often, after each single cube is generated. So random cube is generated, and check_cube is called to see, if it intersects with existing cubes. — Alexander S, Mar 02 '22 at 21:11
Please update your post with the new details, so new readers don't search in the comments; also please add numbers wherever you can (time, speed increase, values, desired time to compute), an a screen shot of your profiling. — Soleil, Mar 03 '22 at 13:56

score 1 · Answer 1 · answered Mar 02 '22 at 13:52

1

The problem with your search is that OpenMP really likes static loops, where the number of iterations is predetermined. Thus, maybe one task will break early, but all the other will go through their full search.

With recent versions of OpenMP (5, I think) there is a solution for that.

(Not sure about this one: Make your tasks much more fine-grained, for instance one for each intersection test);
Spawn your tasks in a taskloop;
Once you find your intersection (or any condition that causes you to break), do cancel taskloop.
Small problem: cancelling is disabled by default. Set the environment variable OMP_CANCELLATION to true.

answered Mar 02 '22 at 13:52

Victor Eijkhout

5,088
2
22
23

Thank you very much! But unfortunately I must use omp 4.0. Is your solution suitable for older version as well? – Alexander S Mar 02 '22 at 18:30
2

The cancel construct is available in OpenMP 4.0, but @AlexanderS use the shared `check` variable to cancel the search (note however OP's code is not correct: reading variable `cancel` should be done using an atomic read). IMO, using the cancel construct will not increase the speed at all. Note also that the `taskloop` construct is only available in OpenMP 4.5, but I do not think it would increase the performance. – Laci Mar 02 '22 at 18:59
Do I understand correctly, that if I change if...break to cancel construction, no improvement will be achieved? – Alexander S Mar 02 '22 at 21:06
I agree with @Laci that you already have a global break. You may wonder whether all those atomic reads will be a bottleneck. At the very least it causes a lot of coherence traffic. An approach based on cancelling does not require all that checking. But predictions are always hard. It's your code, so you should have the motivation to try it out both ways. – Victor Eijkhout Mar 02 '22 at 22:25
I tried. Atomic writes have a minor impact. So yes, nothing changes. – Alexander S Mar 02 '22 at 22:28
1

Next suggestion: maybe it's all that allocation that kills you. Give each thread a `threadprivate` copy of `bd1,bd2`. – Victor Eijkhout Mar 02 '22 at 23:31
I tried, with no much success. – Alexander S Mar 04 '22 at 11:12

Soleil · Answer 2 · 2022-03-02T18:29:42.397

1

Do you have more intersections being true or more being false ? If most are true, you're flooding your hardware with requests to write to a shared resource, and what you are doing is essentially sequential. One way to address this is to avoid using a shared resource so there is no mutex and you let all threads run and at the end you take a decision given the results; this will likely run faster but the benefit depends also on arbitrary choices such as few metrics (eg., nthreads, ncuboids).

It is possible that on another architecture (eg., gpu), your algorithm works well as it is. I may be worth it to benchmark it on a gpu, and see if you will benefit from that migration, given the production sizes (millions of cuboids, 24 dimensions).

You also have a complexity problem, which is, for every new cuboid you compare up to the whole set of existing cuboids. One way to address this is to gather all the cuboids size (range) by dimension and order them, and add the new cuboids ranges ordered. If there is intersection in one dimension, you test the next one etc. You also can runs them in parallel. Before running through the ranges, you test if you are hitting inside the global range, if not it's useless to test locally the intersection.

Here and in general you want to parallelize with minimum of dependency (shared resources, mutex). So you want to try to find a point of view where this will happen. Parallelising over dimensions over ordered ranges (segments) might be better that parallelizing over cuboids.

Algorithms and benefits of parallelism also depend on the values of your objects. This does not mean that complexity predictions are not relevant, but that one may find a smarter approach given those values.

edited Mar 02 '22 at 18:29

answered Mar 02 '22 at 18:24

Soleil

6,404
5
41
61

Appreciate your attention. Most of intersections are false. As soon as any thread finds intersection to be true, I want the program to be cancelled and the true to be returned. Unfortunately I can't think any of the smarter approach I have now without completely changing a lot of code and rewriting many functions which solve some other complicated problems.. – Alexander S Mar 02 '22 at 19:09
@AlexanderS If you are in this situation, I think you have an engineering debt problem, adding features or proposing an alternative should not interfere with your current implementation. For the per-dimension approach, you copy the values in new arrays, and you write new functions, so it's independant. – Soleil Mar 03 '22 at 12:48
1

@AlexanderS I think the separated dimensions on gpu may take less than a millisecond to compute for millions of cubes of dimension 24. Most of the time will be spent on malloc a device-host movements – Soleil Mar 03 '22 at 13:58
Thank you very much. Where should I start If I would like to use in on gpu? Never tried before – Alexander S Mar 04 '22 at 16:32
@AlexanderS download cuda if you have nvidia cards and study their samples. If not oneapi for intel, opencl if you want to be agnostic, or directcompute if you use Windows but want to be gpu vendor independent. The best in terms of support and capabilities is cuda, esp. on Linux. Even a small gpu like 3070 or 3080 will give you results from another world. However, you can already propose alternative algorithms for cpu and saturate your cpu before starting programming gpu, then your code will be kind of close. – Soleil Mar 05 '22 at 10:06

score 1 · Answer 3 · answered Mar 03 '22 at 08:00

I think your code is memory bound, so its bottleneck is memory read/write not calculations. This can be the main reason of poor speed increase. As already mentioned by @Soleil a different hardware (GPU) can be beneficial here.

You mentioned in the comments that Generator::_check_cub called many times. To reduce OpenMP overheads my suggestion is moving the parallel region out of this function, you can even use it in your main function:

main(){
   #pragma omp parallel
   #pragma omp single nowait
   {
     //your code 
   }
}

In this case you have to use #pragma omp taskwait to wait for the tasks to complete.

for (int32_t i = 0; i < nthreads; i++)
{
   #pragma omp task default(none) firstprivate(...) shared (..)
   {
       //your code comes here
   }   
}
#pragma omp taskwait

I also suggest using default(none) clause in #pragma omp task directive so you have to explicitly tell the sharing attribute of all your variables.

Do you really need function get_mcoord? It seems a redunant memory copy to me. I think it may be better to write a check_gjk_intersection function which takes _cuboids or its indices as parameters. In this case you get rid of many memory allocations/deallocations of bd1 and bd2, which also can be time consuming as @Victor pointed out.

How to optimize omp parallelization when batching

3 Answers3