I have the following program:
#pragma omp parallel num_threads(cfg.max_parallel_threads) private(pos,row,info)
{
int threadID = omp_get_thread_num();
#pragma omp for
for (pos = 0;pos < (end - start);pos++)
{
/*some code to define row,B,P diffts and X, L_sim,work*/
dggglm_(&row, &col, &row, B[threadID], &row, P[threadID], &row, diffts[threadID], X[threadID], L_sim[threadID], work[threadID], &lwork, &info);
}
}
dggglm_ is a function of LAPACK. This program occasionally gets stuck (probably when there is a high cpu or memory usage, but not always). For the same data, it sometimes runs OK but gets stuck on another run (randomly). Here is the output of pstack on a child process which seems to be the problem:
Thread 115 (Thread 0x7fcd81bdf700 (LWP 91874)):
#0 0x00007fcd8a078fa6 in ATL_dscal_xp1yp0aXbX () from /usr/lib64/atlas/libatlas.so.3
#1 0x00007fcd8ae22666 in dlarfp_ () from /usr/lib64/atlas/liblapack.so.3
#2 0x00007fcd8adb1a61 in dgeqr2_ () from /usr/lib64/atlas/liblapack.so.3
#3 0x00007fcd8adb1e06 in dgeqrf_ () from /usr/lib64/atlas/liblapack.so.3
#4 0x00007fcd8add9055 in dggqrf_ () from /usr/lib64/atlas/liblapack.so.3
#5 0x00007fcd8add71ae in dggglm_ () from /usr/lib64/atlas/liblapack.so.3
#6 0x000000000041d4ae in sbas::sbas_step2_sbas_linear_new ()
#7 0x0000003b0160e0d5 in ?? () from /usr/lib64/libgomp.so.1
#8 0x0000003af3207aa1 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003af2ee8c4d in clone () from /lib64/libc.so.6
The output of ps -eLo pid,lwp,pcpu | grep 91757
91757 91869 10.7
91757 91870 9.7
91757 91871 9.0
91757 91872 12.0
91757 91873 9.2
91757 91874 41.6
91757 91875 17.7
91757 91876 9.2
91757 91877 8.7
91757 91878 12.5
91757 91880 9.0
You can see the child process 91874 takes a long time and is still running. It seem that lapack went into an endless loop. Can someone suggest a way to debug this?
Thanks.