While running on a TPU instance on Google Colab getting InternalError: Failed to serialize message

Question

I'm trying to train a model on Google Colab using a TPU for a college project. I'm using TensorFlow 1.15.0. Now, as I understand from the TPU examples, I'm converting the tf.keras.models.Model instance to a TPU compatible one with an appropriate distribution strategy (code below).

TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu=TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])

Followed by model creation calls (code below)

with strategy.scope():
  model = define_generator()
  adam = tf.train.AdamOptimizer(learning_rate=0.0002, beta1=0.5, beta2=0.999)
  model.compile(optimizer=adam, loss='mean_absolute_error', metrics=['accuracy'])
  model.summary()
  model.fit(X_train, Y_train, steps_per_epoch=1451, epochs=64, batch_size=8,   callbacks=[term])

Where the define_generator() function is as follows:

# define an encoder block
def define_encoder_block(layer_in, n_filters, batchnorm=True):
# weight initialization
init = RandomNormal(stddev=0.02)
# add downsampling layer    
g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3), padding='same', kernel_initializer=init)(layer_in)
g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3), strides=(2,2), padding='same', kernel_initializer=init)(g)    
g = tf.keras.layers.Conv2D(n_filters, (3,3), padding='same', kernel_initializer=init)(g)  
# conditionally add batch normalization
if batchnorm:
    g = tf.keras.layers.BatchNormalization()(g, training=True)
# leaky relu activation
g = tf.keras.activations.elu(g)
return g


# define a decoder block
def decoder_block(layer_in, skip_in, n_filters, dropout=True):
# weight initialization
init = RandomNormal(stddev=0.02)
# add upsampling layer
g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3),  padding='same', kernel_initializer=init)(layer_in)
g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3),  padding='same', kernel_initializer=init)(layer_in)    
g = tf.keras.layers.Conv2DTranspose(n_filters, (3,3), strides=(2,2),  padding='same', kernel_initializer=init)(g)    
# add batch normalization
g = tf.keras.layers.BatchNormalization()(g, training=True)
# conditionally add dropout
if dropout:
    g = tf.keras.layers.Dropout(0.5)(g, training=True)
# merge with skip connection
g = tf.keras.layers.Concatenate()([g, skip_in])
# relu activation
g = tf.keras.activations.elu(g)
return g

# define complete model
def define_generator(image_shape=(256,256,3)):
# weight initialization
init = RandomNormal(stddev=0.02)
# image input
in_image = tf.keras.layers.Input(shape=image_shape)
# encoder model: C64-C128-C256-C512-C512-C512-C512-C512
e1 = define_encoder_block(in_image, 64, batchnorm=False)
e2 = define_encoder_block(e1, 128)
e3 = define_encoder_block(e2, 256)
e4 = define_encoder_block(e3, 512)
e5 = define_encoder_block(e4, 512)
e6 = define_encoder_block(e5, 512)
e7 = define_encoder_block(e6, 512)
# bottleneck, no batch norm and relu
b = Conv2D(512, (3,3), strides=(2,2), padding='same', kernel_initializer=init)(e7)
b = tf.keras.activations.elu(b)
# decoder model: CD512-CD1024-CD1024-C1024-C1024-C512-C256-C128
d1 = decoder_block(b, e7, 512)
d2 = decoder_block(d1, e6, 512)
d3 = decoder_block(d2, e5, 512)
d4 = decoder_block(d3, e4, 512, dropout=False)
d5 = decoder_block(d4, e3, 256, dropout=False)
d6 = decoder_block(d5, e2, 128, dropout=False)
d7 = decoder_block(d6, e1, 64, dropout=False)
# output
g = tf.keras.layers.Conv2DTranspose(3, (3,3), strides=(2,2), padding='same', kernel_initializer=init)(d7)
out_image = tf.keras.activations.tanh(g)
# define model
model = tf.keras.models.Model(in_image, out_image)
return model

However, I get the InternalError: Failed to serialize message that traces back to model.fit() method and I tried searching everywhere for a solution but was unable to find one. Can somebody please help me out?

Here's the link to my Colab notebook where the full trace can be found:

https://colab.research.google.com/drive/1bA1UlSMGuqH8Ph5PuLfslM2f71SaEtd-

Yes, I encountered several errors after many suggestions but got a lot of different errors every time and went back to gpu — SajanGohil, Apr 03 '20 at 17:17
Have you checked Kaggle out? They've also started experimental TPU support — Bhargav Desai, Apr 03 '20 at 19:14
Kaggle feels somewhat restrictive as they have shorter runtimes and a cap, also, I cant download any file directly without having to commit a kernel, but for tpu's, I haven't — SajanGohil, Apr 03 '20 at 19:18
I see! Btw, brother I just checked your profile out, it seems you're pursuing a Bachelor's degree just like me and interested in AI! Connect over LinkedIn? Here's my profile link: https://www.linkedin.com/in/bhargav-desai-ml/ ! — Bhargav Desai, Apr 03 '20 at 19:29

score 0 · Answer 1 · answered Jun 09 '20 at 19:12

Support for Keras models on TPU has been significantly improved in recent releases. I've gone ahead and updated your code sample for TF 2.2. Most of the changes are simple renames, and the largest change is that I set up your input dataset with tf.data.Dataset. For best results on TPU, we always recommend using tf.data.Dataset instead of using numpy arrays directly with model.fit. If you already have your data in numpy, you can create a a dataset with tf.data.Dataset.from_tensor_slices((X_train, Y_train)), although you may get better results using TFRecords. I don't have access to your original dataset, so I went ahead and substituted random tensors instead.

Here's the updated code:

%tensorflow_version 2.x
import os
import tensorflow as tf
import numpy as np

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

# define an encoder block
def define_encoder_block(layer_in, n_filters, batchnorm=True):
  # weight initialization
  init = tf.keras.initializers.RandomNormal(stddev=0.02)
  # add downsampling layer    
  g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3), padding='same', kernel_initializer=init)(layer_in)
  g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3), strides=(2,2), padding='same', kernel_initializer=init)(g)    
  g = tf.keras.layers.Conv2D(n_filters, (3,3), padding='same', kernel_initializer=init)(g)  
  # conditionally add batch normalization
  if batchnorm:
      g = tf.keras.layers.BatchNormalization()(g, training=True)
  # leaky relu activation
  g = tf.keras.activations.elu(g)
  return g


# define a decoder block
def decoder_block(layer_in, skip_in, n_filters, dropout=True):
  # weight initialization
  init = tf.keras.initializers.RandomNormal(stddev=0.02)
  # add upsampling layer
  g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3),  padding='same', kernel_initializer=init)(layer_in)
  g = tf.keras.layers.Conv2D(int(n_filters/2), (3,3),  padding='same', kernel_initializer=init)(layer_in)    
  g = tf.keras.layers.Conv2DTranspose(n_filters, (3,3), strides=(2,2),  padding='same', kernel_initializer=init)(g)    
  # add batch normalization
  g = tf.keras.layers.BatchNormalization()(g, training=True)
  # conditionally add dropout
  if dropout:
      g = tf.keras.layers.Dropout(0.5)(g, training=True)
  # merge with skip connection
  g = tf.keras.layers.Concatenate()([g, skip_in])
  # relu activation
  g = tf.keras.activations.elu(g)
  return g

# define complete model
def define_generator(image_shape=(256,256,3)):
  # weight initialization
  init = tf.keras.initializers.RandomNormal(stddev=0.02)
  # image input
  in_image = tf.keras.layers.Input(shape=image_shape)
  # encoder model: C64-C128-C256-C512-C512-C512-C512-C512
  e1 = define_encoder_block(in_image, 64, batchnorm=False)
  e2 = define_encoder_block(e1, 128)
  e3 = define_encoder_block(e2, 256)
  e4 = define_encoder_block(e3, 512)
  e5 = define_encoder_block(e4, 512)
  e6 = define_encoder_block(e5, 512)
  e7 = define_encoder_block(e6, 512)
  # bottleneck, no batch norm and relu
  b = tf.keras.layers.Conv2D(512, (3,3), strides=(2,2), padding='same', kernel_initializer=init)(e7)
  b = tf.keras.activations.elu(b)
  # decoder model: CD512-CD1024-CD1024-C1024-C1024-C512-C256-C128
  d1 = decoder_block(b, e7, 512)
  d2 = decoder_block(d1, e6, 512)
  d3 = decoder_block(d2, e5, 512)
  d4 = decoder_block(d3, e4, 512, dropout=False)
  d5 = decoder_block(d4, e3, 256, dropout=False)
  d6 = decoder_block(d5, e2, 128, dropout=False)
  d7 = decoder_block(d6, e1, 64, dropout=False)
  # output
  g = tf.keras.layers.Conv2DTranspose(3, (3,3), strides=(2,2), padding='same', kernel_initializer=init)(d7)
  out_image = tf.keras.activations.tanh(g)
  # define model
  model = tf.keras.models.Model(in_image, out_image)
  return model

# Values from original notebook
# shape = (11612,256,256,3) # this caused my notebook to OOM since it's huge
shape = (256,256,256,3)
batch_size = 8
epochs = 64

# Create fake random dataset
X_train = np.random.rand(*shape)
Y_train = np.random.rand(*shape)
dataset = (tf.data.Dataset.from_tensor_slices((X_train, Y_train))
    .repeat(epochs)
    .batch(batch_size, drop_remainder=True)
    .prefetch(16))

with strategy.scope():
  model = define_generator()
  adam = tf.keras.optimizers.Adam(learning_rate=0.0002, beta_1=0.5, beta_2=0.999)
  model.compile(optimizer=adam, loss='mean_absolute_error', metrics=['accuracy'])
  model.summary()

model.fit(dataset)

While running on a TPU instance on Google Colab getting InternalError: Failed to serialize message

1 Answers1