I am running run_t5_mlm_flax.py with 8 GPU but I get this error (it works with only one GPU). NCCL operation ncclAllReduce(send_buffer, recv_buffer, element_count, dtype, reduce_op, comm, gpu_stream) failed: unhandled cuda error Do you have a suggestion?
Asked
Active
Viewed 100 times