GradienTape convergence much slower than Keras.model.fit

Question

I am currently trying to get a hold of the TF2.0 api, but as I compared the GradientTape to a regular keras.Model.fit I noticed:

It ran slower(probably due to the Eager Execution)
It converged much slower (and I am not sure why).

+--------+--------------+--------------+------------------+
|  Epoch | GradientTape | GradientTape | keras.Model.fit  |
|        |              |  shuffling   |                  |
+--------+--------------+--------------+------------------+
|    1   |     0.905    |     0.918    |      0.8793      |
+--------+--------------+--------------+------------------+
|    2   |     0.352    |     0.634    |      0.2226      |
+--------+--------------+--------------+------------------+
|    3   |     0.285    |     0.518    |      0.1192      |
+--------+--------------+--------------+------------------+
|    4   |     0.282    |     0.458    |      0.1029      |
+--------+--------------+--------------+------------------+
|    5   |     0.275    |     0.421    |      0.0940      |
+--------+--------------+--------------+------------------+

Here is the training loop I used with the GradientTape:


optimizer = keras.optimizers.Adam()
glove_model = GloveModel(vocab_size=len(labels))
train_loss = keras.metrics.Mean(name='train_loss')

@tf.function
def train_step(examples, labels):
    with tf.GradientTape() as tape:
        predictions = glove_model(examples)
        loss = glove_model.glove_loss(labels, predictions)

    gradients = tape.gradient(loss, glove_model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, glove_model.trainable_variables))

    train_loss(loss)



total_step = 0
for epoch in range(epochs_number):

    pbar = tqdm(train_ds.enumerate(), total=int(len(index_data) / batch_size) + 1)

    for ix, (examples, labels) in pbar:

        train_step(examples, labels)


    print(f"Epoch {epoch + 1}, Loss {train_loss.result()}")

    # Reset the metrics for the next epoch
    train_loss.reset_states()

And here is the Keras.Model.fit training:

glove_model.compile(optimizer, glove_model.glove_loss)
glove_model.fit(train_ds, epochs=epochs_number)

Here is the tf.data.Dataset source

train_ds = data.Dataset.from_tensor_slices(
    (np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
).shuffle(100000).batch(batch_size, drop_remainder=True)

And Here is the model.

class GloveModel(keras.Model):

    def __init__(self, vocab_size, dim=100, a=3/4, x_max=100):
        super(GloveModel, self).__init__()

        self.vocab_size = vocab_size
        self.dim = dim
        self.a = a
        self.x_max = x_max

        self.target_embedding = layers.Embedding(
            input_dim=self.vocab_size, output_dim=self.dim, input_length=1, name="target_embedding"
        )
        self.target_bias = layers.Embedding(
            input_dim=self.vocab_size, output_dim=1, input_length=1, name="target_bias"
        )

        self.context_embedding = layers.Embedding(
            input_dim=self.vocab_size, output_dim=self.dim, input_length=1, name="context_embedding"
        )
        self.context_bias = layers.Embedding(
            input_dim=self.vocab_size, output_dim=1, input_length=1, name="context_bias"
        )

        self.dot_product = layers.Dot(axes=-1, name="dot")

        self.prediction = layers.Add(name="add")
        self.step = 0

    def call(self, inputs):

        target_ix = inputs[:, 0]
        context_ix = inputs[:, 1]

        target_embedding = self.target_embedding(target_ix)
        target_bias = self.target_bias(target_ix)

        context_embedding = self.context_embedding(context_ix)
        context_bias = self.context_bias(context_ix)

        dot_product = self.dot_product([target_embedding, context_embedding])
        prediction = self.prediction([dot_product, target_bias, context_bias])

        return prediction

    def glove_loss(self, y_true, y_pred):

        weight = tf.math.minimum(
            tf.math.pow(y_true/self.x_max, self.a), 1.0
        )
        loss_value = tf.math.reduce_mean(weight * tf.math.pow(y_pred - tf.math.log(y_true), 2.0))

        return loss_value

I tried multiple configurations and optimizers but nothing seems to change the convergence rate.

I have exactly the same shuffling between the fit method and GradientTape because I use the tf.Data api. — Benjamin Breton, Oct 30 '19 at 08:34
I think they are not exactly the same. Can you show the code of your `tfds`? Note that keras `.fit` defaults to shuffling before each epoch. You can test by turn off shuffling in keras and compare their convergence rate. — THN, Oct 30 '19 at 14:28
@THN I will send it to you, but I already perform a shuffle with the tf.Dataset api so it shouldn't change anything right ? — Benjamin Breton, Oct 30 '19 at 15:24
It is absolutely amazing that you observe such a difference in training with and without global shuffling. What is your dataset? Is is small and/or correlated to begin with? — P-Gn, Dec 17 '19 at 09:59
It is wikipedia france so not small (~1.6 B tokens), but it is highly correlated, I created a matrix of correlation of tokens to train GLOVE embeddings. — Benjamin Breton, Dec 17 '19 at 17:28

score 2 · Answer 1 · answered Oct 31 '19 at 03:09

2

Dataset.shuffle() only shuffle each minibatch, so each epoch has the same order. Keras .fit() uses some magics to shuffle the whole dataset before each epoch. To do this in TF, you need to use Dataset .repeat(epochs_number) and .shuffle(..., reshuffle_each_iteration=True):

train_ds = data.Dataset.from_tensor_slices(
    (np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1)]), index_data)
    ).shuffle(100000, reshuffle_each_iteration=True
    ).batch(batch_size, drop_remainder=True
    ).repeat(epochs_number)

for ix, (examples, labels) in train_ds.enumerate():
    train_step(examples, labels)
    current_epoch = ix // (len(index_data) // batch_size)

This workaround is not beautiful nor natural, for the moment you can use this to shuffle each epoch. It's a known issue and will be fixed, in the future you can use for epoch in range(epochs_number) instead of .repeat().

answered Oct 31 '19 at 03:09

THN

3,351
3
26
40

I am sorry, I added your code, but the convergence is even slower. I added the results in the column GradientTape shuffle. It doesn't make sense to me... – Benjamin Breton Oct 31 '19 at 22:49
@BenjaminBreton At this point, I doubt there are some other errors lurking in your code. Maybe it is best to link to your repo to show the full code. If you are sure your experiments are correctly conducted, you should open an issue on tensorflow repo. – THN Nov 01 '19 at 05:32
Thank you so much for your help @THN I posted the issue on the TF2.0 repo https://github.com/tensorflow/tensorflow/issues/33898. I will try to reproduce the error with a different model. – Benjamin Breton Nov 01 '19 at 11:19
1

Turns out you were right @THN I shuffled using numpy and it solved the problem. I will post a comprehensive answer – Benjamin Breton Nov 03 '19 at 20:49

score 0 · Accepted Answer · answered Nov 03 '19 at 20:59

The problem came from the shuffling using the tf.Dataset method. It only shuffled through the dataset one bucket at the time. Using the Keras.Model.fit yielded better results because it probably adds another shuffling.

I added a shuffling with numpy.random.shuffle and it improved the performance with both training methods:

The generation of the dataset is now:

numpy_data = np.hstack([index_rows.reshape(-1, 1), index_cols.reshape(-1, 1), index_data.reshape(-1, 1)])

np.random.shuffle(numpy_data)

indexes = np.array(numpy_data[:, :2], dtype=np.uint32)
labels = np.array(numpy_data[:, 2].reshape(-1, 1), dtype=np.float32)

train_ds = data.Dataset.from_tensor_slices(
    (indexes, labels)
).shuffle(100000).batch(batch_size, drop_remainder=True)

And the results are:

+--------+--------------+------------------+
|  Epoch | GradientTape |  keras.Model.fit |
+--------+--------------+------------------+
|    1   |     0.294    |      0.294       |
+--------+--------------+------------------+
|    2   |     0.111    |      0.110       |
+--------+--------------+------------------+
|    3   |     0.089    |      0.089       |
+--------+--------------+------------------+
|    4   |     0.074    |      0.075       |
+--------+--------------+------------------+
|    5   |     0.063    |      0.063       |
+--------+--------------+------------------+

The training type per epoch is roughly the same at 2minutes per epoch.

GradienTape convergence much slower than Keras.model.fit

2 Answers2

Linked