Effective parallelization of heavy tasks in while loop?

Question

Developed a serial version of particle simulation code, now I want to speed up a bit Only on the heaviest task during time-stepping. Basically 3 different tasks (A, B, C) performed during one time-step:

A: 1) update particles contained in a sub-domain (cell)
   2) then update particle's neighbors (particles)

B: 1) update potential contact pairs between particles (close enough)
   2) loop surface points (10-20k per particle) of each contact pair: find contact point

C: Integration: update each particle's position, velocity, etc.

The heaviest task is B.2: normally up to 50~70% CPU time.

So my first idea is to parallelize B.2 and let the rest do serial computation.

...
int N_every_neighbors = 1000;
int N_every_nodes = 100;

while (time())
{
  // update neighbors
  if (curr_steps % N_every_neighbors == 0)
  {
    A.update_cell_sub_rigids();  // light task
    A.update_neighbor_list();    // light task

    B.update_contact_pairs();    // moderate task
    B.update_node_neighbors(check_all);    // heaviest task!
  }

  if (curr_steps % N_every_nodes == 0) 
  {
    B.update_node_neighbors(not_check_all); // second heaviest
  }

  // update particle position, contact forces
  C.integration.initial_integrate();     // light task
  C.integration.update_contact_forces(); // moderate task
  C.integration.final_integrate();       // light task
}
...

The problem is that tasks A, B, C have to be executed sequentially for correct result, i.e. they are NOT independent tasks.

A.1 ---> A.2 ===> B.1 ---> B.2 ===> C.1 ---> C.2 ---> C.3

So what I want to do first is to make the heavy task B.2 B.update_node_neighbors() run in parallel, as there are nested loops in this function.

As I am quite new to OpenMP, so just did some simple optimization.

int N_threads = 8;
omp_set_num_threads(N);

#pragma omp parallel 
#pragma omp single

while (time())
{
  // do tasks A ---> B ---> C;
}


void B::update_node_neighbors (bool check_all)
{
  int All_contact_pairs = this->contact_pairs.size();

  #pragma omp for
  for (int i=0; i<All_contact_pairs; i++)
  {
    auto& particle_i_contacts = this->contact_pairs[i];
    int N_contacts_i = particle_i_contacts.size();
    
    // loop over all contacts for particel i
    for (int j=0; j<N_contacts_i; j++)
    {
      auto& pair_ij = particle_i_contacts[j];
      
      // really heavy computation here
      ...
    }
  }
}

By doing this, I found no significant performance increase. I would like to ask those who are experienced on parallel computation, is there any better way to make the function B.2 run in parallel at each time-step, and let the rest tasks run in serial fashion.

Update 1:

Did some simple test only on the heavy task B.2

while (time())
{
  if (condition_0)
  {
    A.1; 
    A.2

    B.1;
    B.2(true); // heavy task!
  }

  if (condition_1) 
  {
    B.2(false);  // second heaviest
  }

  C.1;
  C.2;
  C.3;
}

The actual content of B.2 is like:

void B::update_node_neighbors(bool check_all)
{
  ...
  
  int N_threads = 6;
  omp_set_num_threads(N_threads);
  
  #pragma omp parallel for schedule(static)
  for (int i=0; i<N_contacts; i++)
  {
    ...

    // particle-particle contacts
    for (int j=0; j<N_contacts_pp; j++)
    {
      for(int pt_id ...)
      {
        // check all particle_i's surface points to particle_j
        // do_the_actual_work
      }
    }
    
    // particle-wall contacts
    for (int k=0; k<N_contacts_pw; k++)
    {
      for(int pt_id ...)
      {
        // check all particle_i's surface points to wall_k
        // do_the_actual_work
      }
    }
}

Tried N_threads = 1,2,4,6,8,10,12; for constant time-steps, the CPU time is more or less the same. Why OpenMP parallel on the out-most loop in B.2 not working? could not figure out:(

Since the code is not a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), I cannot be entirely sure. But to me it looks as if you are calling `update_node_neighbors()` from within a `#pragma omp single` region, meaning that only a single thread is running. Try removing the `#pragma omp parallel` and `#pragma omp single` statements, and replace the `#pragma omp for private(j)` with `#pragma omp parallel for`. Does this help? — Sedenion, Apr 11 '22 at 17:37
Putting `#pragma omp parallel for` inside the function leads to even worse performance. I double threads are created at each time-step in the while loop. we need to create threads only once for the entire time-stepping. — KOF, Apr 11 '22 at 17:53
Have you run a profiler? How much time is spent on `update_node_neighbors`? How many times it is called? — Laci, Apr 11 '22 at 18:02
@KOF Don't worry about thread creation. OMP creates them once and then just puts them to sleep. Starting thousands of parallel regions has almost no effect on runtime. Only worry about things you have measured to be a problem. https://stackoverflow.com/a/71812329/2044454 — Victor Eijkhout, Apr 11 '22 at 18:11
@KOF Have you actually measured my suggestion, or are you just making assumptions? — Sedenion, Apr 11 '22 at 18:24
The only OpenMP pragma you should need is a `#pragma omp parallel for` above the for-loop you're working with. The rest are unnecessary or actively bad. You should probably think about algorithms, though, if you really want good speed-ups. Are the interactions you're considering infinite-range (eg, gravity) or limited-range (eg, surface contact or high-power inverse forces)? — Richard, Apr 11 '22 at 19:02
@Richard Particles are rigid, and only surface contact forces and gravity are considered. — KOF, Apr 11 '22 at 19:06
@KOF please don't include [tags in titles](https://meta.stackexchange.com/questions/19190/should-questions-include-tags-in-their-titles). — user438383, Apr 11 '22 at 19:07
@KOF: You can calculate the surface contact forces in _O(N)_ time. If the particles are drawn to each other by gravity, that requires something more clever like [Barnes-Hut](https://en.wikipedia.org/wiki/Barnes%E2%80%93Hut_simulation). If there's no inter-particle gravity then you can do all force calculations in _O(N)_ time. — Richard, Apr 11 '22 at 19:08
@Richard Because each particle is composed of at least 10-20k surface points, so the heaviest task is to check the closest point for a pair of contacting particles every N steps. Within this N steps we just check the neighbors of last found contact point between a pair of particles. The rest tasks are not that expensive, because standard sub-domain (cell) is created for fast tracking of one particle's neighbors. — KOF, Apr 11 '22 at 19:14
@Richard The Barnes-Hut you mentioned is like Octree structure, while I simply divide the domain with uniform-sized cells, because the particle simulated are quite big (pharmaceutical tablets) compared to the domain, normally only 10k-100k particles are simulated. If > 1 million particles are to simulated, I will extend it to Octree in the future. — KOF, Apr 11 '22 at 19:20
@KOF: I'm pretty sure you can check for contacts without doing all-pairs. — Richard, Apr 11 '22 at 19:26
@Richard `A.update_cell_sub_rigids(); A.update_neighbor_list();` does the job for broad phase contact detection. We only check particles in the adjacent cells of one particle by meanings of MBS (minimum bounding sphere). Not check the rest particles in system. — KOF, Apr 11 '22 at 19:58
@Laci Did some test to parallelize on `B.2`, but it seems not working, does not matter how many threads assigned for the `for loop`. — KOF, Apr 12 '22 at 17:14
Without a [mre], we cannot answer your question in update 1. — Laci, Apr 12 '22 at 18:17
@Laci it takes some time to reproduce it with a minimum version, because the profile is from working code which is very complex. — KOF, Apr 12 '22 at 22:03

Effective parallelization of heavy tasks in while loop?

0 Answers0