I am trying to understand how PyTorch works and want to replicate a simple CNN training on CIFAR. The CNTK script gets to 0.76 accuracy after 168 seconds of training (10 epochs), which is similar to my MXNet script (0.75 accuracy after 153 seconds).
However, my PyTorch script is lagging behind a lot at 0.71 accuracy and 354 seconds. I appreciate I will get differences in accuracy due to stochastic weight initialisation, etc. However the difference across frameworks is much greater than difference within a framework, initialising randomly between runs.
The reasons I can think of:
- MXNet and CNTK are initialized to xavier/glorot uniform; not sure how to do this in PyTorch and so perhaps the weights are initialised to 0
- CNTK does gradient-clipping by default; not sure if PyTorch has the equivalent
- Perhaps the bias is dropped in PyTorch by default
- I use SGD with momentum; perhaps the PyTorch implementation of momentum is a bit different
Edit:
I have tried specifying the weight initialisation, however it seems to have no big effect:
self.conv1 = nn.Conv2d(3, 50, kernel_size=3, padding=1)
init.xavier_uniform(self.conv1.weight, gain=np.sqrt(2.0))
init.constant(self.conv1.bias, 0)