Suboptimal convergence in PyTorch compared to TensorFlow when using Adam optimizer

Question

My program for training a model in PyTorch converges worse than the TensorFlow implementation. When I switch to SGD instead of Adam, the losses are identical. With Adam, the losses are different starting at the very first epoch. I believe I'm using the same settings in both programs. Any thoughts on how to debug this would be helpful!

Losses using SGD

PyTorch

0.1504615843296051
0.10858417302370071
0.08603279292583466

TensorFlow

0.15046157
0.108584
0.08603277

Losses using Adam

PyTorch

0.0031117501202970743
0.0020642257295548916
0.0019268309697508812
0.0016333406092599034
0.0017334128497168422
0.0014430736191570759
0.0010424457723274827
0.0012145100627094507
0.0011195113183930516
0.0009501167223788798
0.0009987876983359456
0.0007953296881169081
0.00075263757025823
0.0008374055614694953
0.000735406531020999

TensorFlow:

0.0036667113
0.0032563617
0.0021536187
0.0015266595
0.0013580231
0.0013878695
0.0011856346
0.0011136091
0.00091276
0.000890126
0.00088381825
0.0007283067
0.00081382995
0.0006670901
0.00046282331

Adam optimizer settings

TF 1.15.3:

adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)

# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, use_locking=False, name="Adam")

PyTorch

torch.optim.Adam(params=model.parameters(), lr=5e-5, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0)

Training

I loaded identical weights from file to initialize the two models.
I trained and tested on a single data sample, also loaded from file. I used 1000 iterations for training and 1 for test, batch size 1.

Prior debugging

As above, I used identical parameters and data
I ran a single forward-backward pass using the Adam optimizer and saved the data and gradients at each layer. I plotted the results. All looked the same and were within 1e-6 to 1e-10 of each other. The loss was also identical within rounding error.

Saving and loading the PyTorch model

def train(...):
    ...
    checkpoint = torch.load(checkpoint_file, map_location=device)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    ...
    counter = 0
    while run:
            counter += 1
            if counter > 1000:
                break

            in = np.load("debug_data/in.npy")
            out1 = np.load("debug_data/out1.npy")
            out2 = np.load("debug_data/out2.npy")

            # adjust from TF
            in = in.squeeze(3)
            in = np.expand_dims(in, axis=0)
            ... do the same for out1 and out2

        in, out1, out2 = \
                torch.from_numpy(in).to(device), \
                torch.from_numpy(out1).to(device), \
                torch.from_numpy(out2).to(device)

        optimizer.zero_grad()
        out1_hat, out2_hat = model(in)

        train_loss = loss_fn(out1_hat, out1) + loss_fn(out2_hat, out2)
        train_loss.backward()

        optimizer.step()

    save_checkpoint({'state_dict': model.state_dict(),
                    'optimizer': optimizer.state_dict()},
                    latest_filename=latest_checkpoint_path)

Saving and loading the TensorFlow model

sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(my_path, graph=sess.graph)

restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
restorer.restore(sess, load_path)

saver = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)

counter = 0
while run:
    counter += 1
    if counter > 1000:
        break

    in = np.load("")
    out1 = np.load("")
    out2 = np.load("")
    out1 = out1[0, :, :, :]
    out1 = out1[:, :, :, np.newaxis]
    out2 = out2[0, :, :, :]
    out2 = out2[:, :, :, np.newaxis]
    in = in[0, :, :, :]
    in = in[:, :, :, np.newaxis]
    _, _loss = sess.run([optimizer, loss],
    feed_dict={in: in, out1: out1, out2: out2})

save_path = saver.save(sess, my_save_path, global_step=int(_global_step))

sess.close()
tf.reset_default_graph()

Have you made any progress on this issue? I'm seeing similar issues where I'm getting much worse convergence training a pytorch model compared to a tensorflow model even though their outputs on the forward pass are equivalent (give or take rounding error) — tomsgd, May 31 '21 at 04:28
There is a library called [ivy](https://github.com/unifyai/ivy) which allows you to implement one algorithm to switch backends (torch/tensorlow) for computation. Maybe its worth trying to see if the issue persists if you can write the same exact code but switch backends? — Robin van Hoorn, May 17 '23 at 20:03

score 0 · Answer 1 · answered Jul 21 '23 at 15:14

0

The default epsilon in TF is 1e-7 not 1e-8. See here and here.

answered Jul 21 '23 at 15:14

krc

483
1
3
12

Suboptimal convergence in PyTorch compared to TensorFlow when using Adam optimizer

1 Answers1