5

I am trying to train a seq2seq model for language translation, and I am copy-pasting code from this Kaggle Notebook on Google Colab. The code is working fine with CPU and GPU, but it is giving me errors while training on a TPU. This same question has been already asked here.

Here is my code:

    strategy = tf.distribute.experimental.TPUStrategy(resolver)
    
    with strategy.scope():
      model = create_model()
      model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy')
    
    model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                        steps_per_epoch = train_samples // batch_size,
                        epochs = epochs,
                        validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                        validation_steps = val_samples // batch_size)

Traceback:

Epoch 1/2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-60-940fe0ee3c8b> in <module>()
      3                     epochs = epochs,
      4                     validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
----> 5                     validation_steps = val_samples // batch_size)

10 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    992           except Exception as e:  # pylint:disable=broad-except
    993             if hasattr(e, "ag_error_metadata"):
--> 994               raise e.ag_error_metadata.to_exception(e)
    995             else:
    996               raise

ValueError: in user code:
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:853 train_function  *
    return step_function(self, iterator)
    /usr/local/lib/python3.7/dist-packages/keras/engine/training.py:842 step_function  **
    outputs = model.distribute_strategy.run(run_step, args=(data,))
...
ValueError: None values not supported.

I couldn't figure out the error, and I think the error is because of this generate_batch function:

X, y = lines['english_sentence'], lines['hindi_sentence']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 34)

def generate_batch(X = X_train, y = y_train, batch_size = 128):
    while True:
        for j in range(0, len(X), batch_size):
 
            encoder_input_data = np.zeros((batch_size, max_length_src), dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar), dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens), dtype='float32')
            
            for i, (input_text, target_text) in enumerate(zip(X[j:j + batch_size], y[j:j + batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word]
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word]
                    if t>0:

                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

My Colab notebook - here
Kaggle dataset - here
TensorFlow version - 2.6

Edit - Please don't tell me to down-grade TensorFlow/Keras version to 1.x. I can down-grade it to TensorFlow 2.0, 2.1, 2.3 but not 1.x. I don't understand TensorFlow 1.x. Also, there is no point in using a 3-year-old version.

Adarsh Wase
  • 1,727
  • 3
  • 12
  • 26
  • Does you dataset have None/Null values? What version of Tensorflow are you using?, it should work fine with TF 2.5+. – Gagik Nov 05 '21 at 03:29
  • @Gagik, TF 2.6 : Yes, it has some NaN values, I have dropped them, but it's still giving me the same error. – Adarsh Wase Nov 05 '21 at 06:00
  • I looked at your code and you dont add OOV token to your vocabulary(+1) what will happen if an unseen word enter to the model? – bitbang Nov 06 '21 at 21:53
  • Can you try model with dummy data like this please? xx = [np.random.rand(20,30), np.random.rand(20,30)] yy = np.random.rand(20,30) model.fit(x=xx, y=yy, epochs = 2) I cant figure out your models input outputshapes. If you try with dummy data on TPU maybe we can understand where is the bug – bitbang Nov 06 '21 at 21:57
  • I can't fit that dummy data: ValueError: Shapes (None, None) and (None, 30, 81978) are incompatible – Adarsh Wase Nov 07 '21 at 06:37
  • 1
    Oh ok , i thought you designed this model. Cpu and gpu training happens on one processor, but for tpu training: model mirrors itself for every tpu core. So you need to be little bit more cautious for re-distributing the losses and other metrics – bitbang Nov 07 '21 at 07:27

3 Answers3

1

As stated in the referenced answer in the link you provided, tensorflow.data API works better with TPUs. In order to adapt it in your case, try to use return instead of yield in generate_batch function:

def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ...
    return encoder_input_data, decoder_input_data, decoder_target_dat

encoder_input_data, decoder_input_data, decoder_target_data = generate_batch(X_train, y_train, batch_size=128)

And then use tensorflow.data to structure your data:

from tensorflow.data import Dataset

encoder_input_data = Dataset.from_tensor_slices(encoder_input_data)
decoder_input_data = Dataset.from_tensor_slices(decoder_input_data)
decoder_target_data = Dataset.from_tensor_slices(decoder_target_data)
ds = Dataset.zip((encoder_input_data, decoder_input_data, decoder_target_data)).map(map_fn).batch(1024)

where map_fn is defined by:

def map_fn(encoder_input ,decoder_input, decoder_target):
    return (encoder_input ,decoder_input), decoder_target

And finally use Model.fit instead of Model.fit_generator:

model.fit(x=ds, epochs=epochs)
R. Marolahy
  • 1,325
  • 7
  • 17
0

Need to down-grade to Keras 1.0.2 If works then great, otherwise I will tell other solution.

Faisal Shahbaz
  • 431
  • 3
  • 12
  • No, I want to use TensorFlow and Keras 2.x. I don't understand 1.x version. Can you tell me any solution using 2.x? – Adarsh Wase Nov 08 '21 at 20:03
0

You need to update Keras and your problem will be fixed

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 08 '21 at 21:29
  • I am using TensorFlow 2.6 (Keras included), and 2.6 is the latest version., **I can't update it anymore.** – Adarsh Wase Nov 09 '21 at 05:54