My program for training a model in PyTorch converges worse than the TensorFlow implementation. When I switch to SGD instead of Adam, the losses are identical. With Adam, the losses are different starting at the very first epoch. I believe I'm using the same settings in both programs. Any thoughts on how to debug this would be helpful!
Losses using SGD
PyTorch
0.1504615843296051
0.10858417302370071
0.08603279292583466
TensorFlow
0.15046157
0.108584
0.08603277
Losses using Adam
PyTorch
0.0031117501202970743
0.0020642257295548916
0.0019268309697508812
0.0016333406092599034
0.0017334128497168422
0.0014430736191570759
0.0010424457723274827
0.0012145100627094507
0.0011195113183930516
0.0009501167223788798
0.0009987876983359456
0.0007953296881169081
0.00075263757025823
0.0008374055614694953
0.000735406531020999
TensorFlow:
0.0036667113
0.0032563617
0.0021536187
0.0015266595
0.0013580231
0.0013878695
0.0011856346
0.0011136091
0.00091276
0.000890126
0.00088381825
0.0007283067
0.00081382995
0.0006670901
0.00046282331
Adam optimizer settings
TF 1.15.3:
adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)
# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, use_locking=False, name="Adam")
PyTorch
torch.optim.Adam(params=model.parameters(), lr=5e-5, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0)
Training
- I loaded identical weights from file to initialize the two models.
- I trained and tested on a single data sample, also loaded from file. I used 1000 iterations for training and 1 for test, batch size 1.
Prior debugging
- As above, I used identical parameters and data
- I ran a single forward-backward pass using the Adam optimizer and saved the data and gradients at each layer. I plotted the results. All looked the same and were within 1e-6 to 1e-10 of each other. The loss was also identical within rounding error.
Saving and loading the PyTorch model
def train(...):
...
checkpoint = torch.load(checkpoint_file, map_location=device)
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
...
counter = 0
while run:
counter += 1
if counter > 1000:
break
in = np.load("debug_data/in.npy")
out1 = np.load("debug_data/out1.npy")
out2 = np.load("debug_data/out2.npy")
# adjust from TF
in = in.squeeze(3)
in = np.expand_dims(in, axis=0)
... do the same for out1 and out2
in, out1, out2 = \
torch.from_numpy(in).to(device), \
torch.from_numpy(out1).to(device), \
torch.from_numpy(out2).to(device)
optimizer.zero_grad()
out1_hat, out2_hat = model(in)
train_loss = loss_fn(out1_hat, out1) + loss_fn(out2_hat, out2)
train_loss.backward()
optimizer.step()
save_checkpoint({'state_dict': model.state_dict(),
'optimizer': optimizer.state_dict()},
latest_filename=latest_checkpoint_path)
Saving and loading the TensorFlow model
sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(my_path, graph=sess.graph)
restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
restorer.restore(sess, load_path)
saver = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
counter = 0
while run:
counter += 1
if counter > 1000:
break
in = np.load("")
out1 = np.load("")
out2 = np.load("")
out1 = out1[0, :, :, :]
out1 = out1[:, :, :, np.newaxis]
out2 = out2[0, :, :, :]
out2 = out2[:, :, :, np.newaxis]
in = in[0, :, :, :]
in = in[:, :, :, np.newaxis]
_, _loss = sess.run([optimizer, loss],
feed_dict={in: in, out1: out1, out2: out2})
save_path = saver.save(sess, my_save_path, global_step=int(_global_step))
sess.close()
tf.reset_default_graph()