2

I am currently training a neural network with the help of a TPU. I changed the runtime type and initialized the TPU. I have the feeling that it is still not faster. I used https://www.tensorflow.org/guide/tpu. Did I something wrong?

# TPU initialization
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

.
.
.
# experimental_steps_per_execution = 50
model.compile(optimizer=Adam(lr=learning_rate), loss='binary_crossentropy', metrics=['accuracy'], experimental_steps_per_execution = 50)

The summary of my model

enter image description here

Is there anything I still have to consider or adjust?

Andrey
  • 5,932
  • 3
  • 17
  • 35

1 Answers1

1

You need to create TPU strategy:

strategy = tf.distribute.TPUStrategy(resolver).

And than use this strategy properly:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])
Andrey
  • 5,932
  • 3
  • 17
  • 35
  • thank you very much for the answer! And how do I create the TPU strategy? May you have a Code snippet? –  Nov 05 '20 at 08:04
  • how do you handle this error `ResourceExhaustedError: 9 root error(s) found. (0) Resource exhausted: {{function_node __inference_train_function_14917}} Compilation failure: Ran out of memory in memory space hbm. Used 8.29G of 7.48G hbm. Exceeded hbm capacity by 825.64M.` ? –  Nov 05 '20 at 08:51
  • 1
    your model is huge. Try to decrease batch_size to 8 – Andrey Nov 05 '20 at 08:55
  • Sorry for the trouble. I tried batch_size = 8. Unfortunately, the error keeps recurring. –  Nov 05 '20 at 18:13
  • try batch_size = 1 – Andrey Nov 05 '20 at 18:43
  • you have no choice other than simplifying your model :( – Andrey Nov 05 '20 at 18:58