Developed a serial version of particle simulation code, now I want to speed up a bit Only on the heaviest task during time-stepping. Basically 3 different tasks (A, B, C) performed during one time-step:
A: 1) update particles contained in a sub-domain (cell)
2) then update particle's neighbors (particles)
B: 1) update potential contact pairs between particles (close enough)
2) loop surface points (10-20k per particle) of each contact pair: find contact point
C: Integration: update each particle's position, velocity, etc.
The heaviest task is B.2
: normally up to 50~70% CPU time.
So my first idea is to parallelize B.2
and let the rest do serial computation.
...
int N_every_neighbors = 1000;
int N_every_nodes = 100;
while (time())
{
// update neighbors
if (curr_steps % N_every_neighbors == 0)
{
A.update_cell_sub_rigids(); // light task
A.update_neighbor_list(); // light task
B.update_contact_pairs(); // moderate task
B.update_node_neighbors(check_all); // heaviest task!
}
if (curr_steps % N_every_nodes == 0)
{
B.update_node_neighbors(not_check_all); // second heaviest
}
// update particle position, contact forces
C.integration.initial_integrate(); // light task
C.integration.update_contact_forces(); // moderate task
C.integration.final_integrate(); // light task
}
...
The problem is that tasks A, B, C have to be executed sequentially for correct result, i.e. they are NOT independent tasks.
A.1 ---> A.2 ===> B.1 ---> B.2 ===> C.1 ---> C.2 ---> C.3
So what I want to do first is to make the heavy task B.2 B.update_node_neighbors()
run in parallel, as there are nested loops in this function.
As I am quite new to OpenMP, so just did some simple optimization.
int N_threads = 8;
omp_set_num_threads(N);
#pragma omp parallel
#pragma omp single
while (time())
{
// do tasks A ---> B ---> C;
}
void B::update_node_neighbors (bool check_all)
{
int All_contact_pairs = this->contact_pairs.size();
#pragma omp for
for (int i=0; i<All_contact_pairs; i++)
{
auto& particle_i_contacts = this->contact_pairs[i];
int N_contacts_i = particle_i_contacts.size();
// loop over all contacts for particel i
for (int j=0; j<N_contacts_i; j++)
{
auto& pair_ij = particle_i_contacts[j];
// really heavy computation here
...
}
}
}
By doing this, I found no significant performance increase. I would like to ask those who are experienced on parallel computation, is there any better way to make the function B.2
run in parallel at each time-step, and let the rest tasks run in serial fashion.
Update 1:
Did some simple test only on the heavy task B.2
while (time())
{
if (condition_0)
{
A.1;
A.2
B.1;
B.2(true); // heavy task!
}
if (condition_1)
{
B.2(false); // second heaviest
}
C.1;
C.2;
C.3;
}
The actual content of B.2
is like:
void B::update_node_neighbors(bool check_all)
{
...
int N_threads = 6;
omp_set_num_threads(N_threads);
#pragma omp parallel for schedule(static)
for (int i=0; i<N_contacts; i++)
{
...
// particle-particle contacts
for (int j=0; j<N_contacts_pp; j++)
{
for(int pt_id ...)
{
// check all particle_i's surface points to particle_j
// do_the_actual_work
}
}
// particle-wall contacts
for (int k=0; k<N_contacts_pw; k++)
{
for(int pt_id ...)
{
// check all particle_i's surface points to wall_k
// do_the_actual_work
}
}
}
Tried N_threads = 1,2,4,6,8,10,12; for constant time-steps, the CPU time is more or less the same. Why OpenMP parallel on the out-most loop in B.2
not working? could not figure out:(