2

I'm facing some problems while trying to fit my model using TPU on kaggle.

Tpu already's initialized:

try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print(f'Running on TPU {tpu.master()}')
except ValueError:
    tpu = None
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    strategy = tf.distribute.get_strategy()

AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')

But when i try to fit my model, this error is raised:

{{function_node __inference_train_function_64094}} failed to connect to all addresses
GRPC error information:{"created":"@1609444822.190891136","description":"Failed to pick
subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc",
file_line":3959,"referenced_errors": [{"created":"@1609444822.190889693"
,"description":"failed to connect to all addresses", […] 
[[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall][[IteratorGetNextAsOptional]]
rdn
  • 33
  • 1
  • 3

1 Answers1

0

You have to create your model and optimizer within strategy scope:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])
Andrey
  • 5,932
  • 3
  • 17
  • 35