I am experimenting with using UCX to provide more portable MPI app containers with performance. So I want to compare between using UCX replacement method which mount system built UCX into container at runtime(sure with other interconnect libraries) and with the one that using embedded UCX (mount in only the interconnect libraries). The latter hang when doing test with osu_allreduce and some other collective osu tests(not all).
I test the performance with osu_pt2pt_latency and it seems fine but when I do test with osu_allreduce for the variant that uses embedded UCX the job that I submit to slurm hangs right after the test printed its result(state of the slurm job is Running but no further output). This also happen to osu_barrier, osu_bcast, osu_scatter,osu_gather, osu_reduce and osu_reduce_scatter but not with osu_allgather, osu_alltoall for the collective tests.The debug messages from UCX(set UCX_LOG_LEVEL=debug) do not show anything suspicious they just stop right after the endpoint is successfully disconnected. Does anyone ever face the same problem and any suggestions for the causes and solutions?