In Nvidia website, they claimed MXNet uses NCCL (https://developer.nvidia.com/nccl). However, I haven't found any reference from MXNet's github repository that they actually use NCCL library.
In the chainer blog, they also claimed that chainer achieves better performance than MXNet on 4 GPUs because of the use of NCCL library in chainer.(https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html)
In some of the old posts in MXNet repository, I can see that they were talking about the difficulty in including the NCCL library in MXNet.
My first question is, is there any version of MXNet with NCCL library? Second, what might be the performance implications of using NCCL library (i.e. less memory usage, lesser communication overhead across multiple GPUs)?