2

I've been trying to investigate into the reason (e.g. by checking weights, gradients and activations during training) why SGD with a 0.001 learning rate worked in training while Adam fails to do so. (Please see my previous post [here](Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"Why is my loss (binary cross entropy) converging on ~0.6? (Task: Natural Language Inference)"))

Note: I'm using the same model from my previous post here as well.


using tf.keras, i trained the neural network using model.fit():

model.compile(optimizer=SGD(learning_rate=0.001),
            loss='binary_crossentropy',
            metrics=['accuracy'])

model.fit(x=ds,
        epoch=80,
        validation_data=ds_val)

This resulted in a epoch loss graphed below, within the 1st epoch, it's reached a train loss of 0.46 and then ultimately resulting in a train_loss of 0.1241 and val_loss of 0.2849.

model.fit sgd(0.001) success

I would've used tf.keras.callbacks.Tensorboard(histogram_freq=1) to train the network with both SGD(0.001) and Adam to investigate but it's throwing an InvalidArgumentError on Variable:0, something I can't decipher. So I tried to write a custom training loop using GradientTape and plotting the values.


using tf.GradientTape(), i tried to reproduce the results using the exact same model and dataset, however the epoch loss is training incredibly slowly, reaching train loss of 0.676 after 15 epochs (see graph below), is there something wrong with my implementation? (code below)

GradientTape sgd(0.001) fails

@tf.function
def compute_grads(train_batch: Dict[str,tf.Tensor], target_batch: tf.Tensor, 
                 loss_fn: Loss, model: tf.keras.Model):
    with tf.GradientTape(persistent=False) as tape:
        # forward pass
        outputs = model(train_batch)
        # calculate loss
        loss = loss_fn(y_true=target_batch, y_pred=outputs)

    # calculate gradients for each param
    grads = tape.gradient(loss, model.trainable_variables)
    return grads, loss

BATCH_SIZE = 8
EPOCHS = 15

bce = BinaryCrossentropy()
optimizer = SGD(learning_rate=0.001)

for epoch in tqdm(range(EPOCHS), desc='epoch'):
    # - accumulators
    epoch_loss = 0.0

    for (i, (train_batch, target_dict)) in tqdm(enumerate(ds_train.shuffle(1024).batch(BATCH_SIZE)), desc='step'):

        (grads, loss) = compute_grads(train_batch, target_dict['target'], bce, model)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        epoch_loss += loss

    avg_epoch_loss = epoch_loss/(i+1)
    tensorboard_scalar(writer, name='epoch_loss', data=avg_epoch_loss, step=epoch)  # custom helper function
    print("Epoch {}: epoch_loss = {}".format(epoch, avg_epoch_loss))

Thanks in advance!

Jack-P
  • 75
  • 1
  • 3
  • 5

1 Answers1

1

Check if you have shuffle your dataset then the problem may came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling. By adding a shuffling with numpy.random.shuffle it may improve the training performance. From this reference.

The example of applying it into generation of the dataset is:

numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])

np.random.shuffle(numpy_data)

indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)

train_ds = data.Dataset.from_tensor_slices(
    (indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)

If this not work you may need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):

train_ds = data.Dataset.from_tensor_slices(
    (np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
    ).shuffle(100000, reshuffle_each_iteration=True
    ).batch(batch_size, drop_remainder=True
    ).repeat(epochs_number)

for ix, (examples, labels) in train_ds.enumerate():
    train_step(examples, labels)
    current_epoch = ix // (len(index_data) // batch_size)

This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat()

The solution provided here may also help a lot. You might want to check it out.

If this is not the case, you may want to speed up the TF2.0 GradientTape. This can be the solution: TensorFlow 2.0 introduces the concept of functions, which translate eager code into graph code.

The usage is pretty straight-forward. The only change needed is that all relevant functions (like compute_loss and apply_gradients) have to be annotated with @tf.function.

TF_Support
  • 1,794
  • 1
  • 7
  • 16