0

I am currently trying to train a model using tf.GradientTape, as model.fit(...) from keras will not be able to handle my data input in the future. However, while a test run with model.fit(...) and my model works perfectly, tf.GradientTape does not.

During training, the loss using the tf.GradientTape custom workflow will first slightly decrease, but then become stuck and not improve any further, no matter how many epochs I run. The chosen metric will also not change after the first few batches. Additionally, the loss per batch is unstable and jumps between nearly zero to something very large. The running loss is more stable but shows the model not improving. This is all in contrast to using model.fit(...), where loss and metrics are improving immediately.

My code:

def build_model(kernel_regularizer=l2(0.0001), dropout=0.001, recurrent_dropout=0.):
    x1 = Input(62)
    x2 = Input((62, 3))

    x = Embedding(30, 100, mask_zero=True)(x1)
    x = Concatenate()([x, x2])

    x = Bidirectional(LSTM(500,
                           return_sequences=True,
                           kernel_regularizer=kernel_regularizer,
                           dropout=dropout,
                           recurrent_dropout=recurrent_dropout))(x)

    x = Bidirectional(LSTM(500,
                           return_sequences=False,
                           kernel_regularizer=kernel_regularizer,
                           dropout=dropout,
                           recurrent_dropout=recurrent_dropout))(x)

    x = Activation('softmax')(x)

    x = Dense(1000)(x)
    x = Dense(500)(x)
    x = Dense(250)(x)
    x = Dense(1, bias_initializer='ones')(x)

    x = tf.math.abs(x)
    return Model(inputs=[x1, x2], outputs=x)


optimizer = Adam(learning_rate=0.0001)

model = build_model()
model.compile(optimizer=optimizer, loss='mse', metrics='mse')

options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA 
dat_train = tf.data.Dataset.from_generator(
    generator= lambda: <load_function()> 
    output_types=((tf.int32, tf.float32), tf.float32)
) 
dat_train = dat_train.with_options(options) 

# keras training
model.fit(dat_train, epochs=50)


# custom training
for epoch in range(50):
    for (x1, x2), y in dat_train:
        with tf.GradientTape() as tape:
            y_pred = model((x1, x2), training=True)
            loss = model.loss(y, y_pred)
        grads = tape.gradient(loss, model.trainable_variables)
        model.optimizer.apply_gradients(zip(grads, model.trainable_variables))

I could use relu at the output layer, however, I found the abs to be more robust. Changing it does not change the outcome. The input x1 of the model is a sequence, x2 are some additional features, that are later concatenated to the embedded x1 sequence. For my approach, I'm not using the MSE, but it works either way.

I could provide some data, however, my dataset is quite large, so I would need to extract a bit out of it.

All in all, my problem seems to be similar to: Keras model doesn't train when using GradientTape


Edit 1

The softmax activation is currently not necessary, but is relevant for my future goal of splitting the model. Additionally, some things I noticed:

  1. The custom training takes roughly 2x the amount of time compared to model.fit(...).
  2. The gradients in the custom training seem very small and range from ±1e-3 to ±1e-9 inside the model. I don't know if that's normal and don't know how to compare it to the gradients provided by model.fit(...).

Edit 2

I've added a Google Colab notebook to reproduce the issue:

https://colab.research.google.com/drive/1pk66rbiux5vHZcav9VNSBhdWWIhQM-nF?usp=sharing

The loss and MSE for 20 epochs is shown here:

custom training

keras training

While I only used a portion of my data in the notebook, it will still run for a very long time. For the custom training run, the loss for each batch is simply stored in losses. It matches the behavior in the custom training run image. So far, I've noticed two ways of improving the performance of the custom training:

  1. The usage of custom layer initialization
  2. Using MSE as a loss function

Using the MSE, compared to my own loss function actually improves the custom training performance. Still, using MSE and/or different initialization won't come close to the performance of keras fit.

  • Have you tried `optimizer.apply_gradients(zip(grads, model.trainable_variables))`? – AloneTogether Aug 22 '22 at 06:32
  • @AloneTogether Yes, sadly it doesn't work as well. – Matteo Pilz Aug 22 '22 at 06:33
  • and use a separate loss to your model..not the loss you defined in `model.compile`.. `model.compile` does not have to be called at all – AloneTogether Aug 22 '22 at 06:35
  • @AloneTogether, you are right and that is how I do it when I'm not using `model.fit(...)`. I leave out compile and call the loss directly. I just left it here to make it concise. – Matteo Pilz Aug 22 '22 at 06:38
  • @AloneTogether, I've added it to my post. – Matteo Pilz Aug 22 '22 at 06:49
  • @AloneTogether, I'm not providing a batch argument. My data is already loaded in as a batch/group. The actual loss is also calculated using the ratios between the entries inside this group. However, using the MSE also somewhat works. – Matteo Pilz Aug 22 '22 at 07:09
  • The code you provided first trains first trains with fit, _then_ with a custom loop. The custom loop also doesn't print any loss or metric values. So either there are some major issues with your code, or the code you provided is not what you are actually using for training. – xdurch0 Aug 22 '22 at 07:31
  • @xdurch0, it is not the code I'm using for my training. I also do not use first fit and then the custom training. In general I comment one of them out, when running the code. I left out the I think non relevant parts of printing and writing the summary, saving the model, etc. – Matteo Pilz Aug 22 '22 at 07:41
  • You are iterating over `(x1, x2)` but are putting `x` into the model... – xdurch0 Aug 22 '22 at 08:13
  • @xdurch0, my mistake, in my code I iterate over x, y, so this was a fragment from that. – Matteo Pilz Aug 22 '22 at 08:19
  • and you post a plot of the two trainings? also, can you provide a colab link with a reproducible example? – Alberto Sinigaglia Aug 22 '22 at 11:18

1 Answers1

1

I have found the solution, it was a simple shape mismatch, which was somehow not picked up by any error check and worked both with my custom loss function and MSE. Using x = Reshape(())(x) as final layer did the trick.