I found some similar topics but no helpful solution was found. Since I have some more information to provide, I opened this issue.
My PyTorch script frequently gets stuck on a training server.
Htop shows that there is only one green
CPU bar while other active cores are almost 100% red
. According to the F1
explanation, red means kernel time.
Whenever this 100% red CPU bar occurs, the training gets stuck and GPU-util drops down to 0%. Wired thing is this only happens on two of the servers I use. It never happens on my PC (less powerful) and never happens on another powerful server.
The strace
command shows that when the problem occurs, there will be many
futex(0x55bbb0e82db0, FUTEX_WAKE_PRIVATE, 1) = 0
Any explanation on what the problem is and how to avoid this. Or any further information to provide?