0

I am training a model with Keras which constitutes of a Huggingface RoBERTa model as a backbone with a downstream task of span prediction and binary prediction for text.

I have been training the model regularly with datasets which are under 2 Gb in size, which has worked fine. The dataset has grown in size in recent weeks and now recently, it has gotten to around 2.3 Gb in size which makes it over the 2 Gb google protobuf hard limit. This makes it impossible to train the model with keras with numpy tensors without a generator on TPUs as tensorflow uses google protobuf to buffer the tensors for the TPUs, and trying to serve all the data without a generator fails. If I use a dataset under 2 Gb in size, everything works fine. TPUs don't support Keras generators yet, so I was looking into using the tf.data.Dataset api instead.

After seeing this question I adopted code from this gist trying to get this to work, resulting in the following code:

def tfdata_generator(x, y, is_training, batch_size=384):

    dataset = tf.data.Dataset.from_tensor_slices((x, y))

    if is_training:
        dataset = dataset.shuffle(1000)
    dataset = dataset.map(map_fn)
    dataset = dataset.batch(batch_size)
    dataset = dataset.repeat()
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    return dataset

The model is created and compiled for TPU use as before which has never caused any problems and then I create the generators and call the fit function:

train_gen = tfdata_generator(x_train, y_train, is_training=True)

model.fit(
  train_gen,
  steps_per_epoch=10000,
  epochs=1,
)

This results in the following error:

FetchOutputs node : not found [Op:AutoShardDataset]

edit: Colab with bare minimum code and a dummy dataset - unfortunately, b/c of Colab RAM restrictions, building a dummy dataset exceeding 2 Gb in size crashes the notebook. But still, displays code that runs and works on CPU/TPU with a smaller dataset.

This code does however work on a CPU. I can't find any further information on this error online and haven't been able to find more detailed information on how to use TPUs with Keras servicing training data using generators. Have looked into tfrecords a bit but also find documentation on TPUs missing. All help appreciated!

st0ne
  • 106
  • 1
  • 9

1 Answers1

0

For numpy tensors, 2GB seams to a hard limit for TPU training (as of now). I see 2 workarounds that you could use.

  1. Write your tf.data to a gs bucket as TFRecord/CSV using TFRecordWriter and let the TPU use training data from that bucket.
  2. Use tf.data service, for your input pipeline. It's a relatively new service that let's you run your data pipeline on separate workers. For details on how to run please see running_the_tfdata_service.
Gagik
  • 396
  • 3
  • 6
  • I did manage to reduce the size of the dataset by casting the data to uint16 but will need to be able to train larger datasets eventually. I'll report back after I've tried using TFRecords. Any suggestions on how to feed the data as a TFRecord / tf.data to a keras model from GCS? – st0ne Jan 26 '21 at 19:48
  • Here is an example code: https://github.com/tensorflow/models/blob/master/official/recommendation/ncf_input_pipeline.py#L33 where input_file_pattern can point to your training/test files, e.g. "gs://my_bucket/training/*" if TFRecords are stored in the directory "gs://my_bucket/training/" – Gagik Jan 26 '21 at 20:45