enter image description here enter image description here I add an online eval function in my training process, and this bug caused by dist.all_reduce only occurs after the eval function has been called many times.
and after I delete this eval function, this code can finish the training process. cuda version:11.4, NCCL version 2.10.3,while I check their compatibility on nvidiaenter image description here And I do not know how to solve this problem. Thanks in advance!