Training loss is Nan using image segmentation in TPU using TFrecords

Question

I am a beginner trying to work with TPUs using Tensorflow in Kaggle Kernels. I previously trained an Unet model using a dataset in GPU, and now I am trying to implement that in TPU. I made a tfrecord out of the dataset images and mask, and the TFrecord returns image and mask. When I try to train in TPU, the loss is always Nan, even though the metrics accuracy is normal. Since this is the same model and loss I used in GPU, I am guessing the problem is in tfrecord or loading dataset. The code for loading data is given below :

 def decode_image(image_data):
    image = tf.image.decode_jpeg(image_data, channels=3)
    image = tf.cast(image, tf.float32) / (255.0)  # convert image to floats in [0, 1] range
    image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
    return image

def decode_image_mask(image_data):
    image = tf.image.decode_jpeg(image_data, channels=3)
    image = tf.cast(image, tf.float64) / (255.0)  # convert image to floats in [0, 1] range
    image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
    image=tf.image.rgb_to_grayscale(image)
    image=tf.math.round(image)
    return image

def read_tfrecord(example):
    TFREC_FORMAT = {
        "image": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
        "mask": tf.io.FixedLenFeature([], tf.string),  # shape [] means single element
    }
    example = tf.io.parse_single_example(example, TFREC_FORMAT)
    image = decode_image(example['image'])
    mask=decode_image_mask(example['mask'])
    return image, mask 



def load_dataset(filenames, ordered=False):
    # Read from TFRecords. For optimal performance, reading from multiple files at once and
    # disregarding data order. Order does not matter since we will be shuffling the data anyway.

    ignore_order = tf.data.Options()
    if not ordered:
        ignore_order.experimental_deterministic = False # disable order, increase speed

    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) # automatically interleaves reads from multiple files
    dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
    dataset = dataset.map(read_tfrecord, num_parallel_calls=AUTO)
    return dataset



def get_training_dataset():
    dataset = load_dataset(TRAINING_FILENAMES)
    dataset = dataset.repeat() # the training dataset must repeat for several epochs
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)
    dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
    return dataset

def get_validation_dataset(ordered=False):
    dataset = load_dataset(VALIDATION_FILENAMES, ordered=ordered)
    dataset = dataset.batch(BATCH_SIZE)
    dataset = dataset.cache()
    dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
    return dataset


def count_data_items(filenames):
    # the number of data items is written in the name of the .tfrec files, i.e. flowers00-230.tfrec = 230 data items
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

So, what am I doing wrong?

Here are few suggestions: try decreasing learning rate (it matters with TPU), check where in your network gradient might be exploding/vanishing and try to cast your input to bfloat16. Let me know if this helps — Proko, Jun 16 '20 at 08:18

score 1 · Answer 1 · answered Jun 16 '20 at 11:29

Turns out the problem was that I was unbatching the data and batching it to 20 to properly view the image and masks in matplotlib, and this was screwing up how data was being sent to the model, hence the Nan loss. Making another copy of the dataset and using that to view image, while sending the original to train solved this problem.

Training loss is Nan using image segmentation in TPU using TFrecords

1 Answers1