5

I'm a newbie to ML. When trying to complete digit recognition with TPU method, I encountered following problems.

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
    Model = Sequential([

        InputLayer((28, 28, 1)),
        Dropout(0.1),
        Conv2D(128, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Conv2D(64, 3, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        MaxPooling2D(2, 2),
        Flatten(),
        Dense(128, use_bias=False),
        LeakyReLU(0.05),
        BatchNormalization(),
        Dense(10, activation='softmax')

    ])

with strategy.scope():
    Model.compile(optimizer='adam',
                  loss='categorical_crossentropy', metrics='accuracy') 
CancelledError: 4 root error(s) found.
  (0) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_1 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (1) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_6 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (2) Cancelled:  Operation was cancelled
     [[node IteratorGetNextAsOptional_3 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
  (3) Cancelled:  Iterator was cancelled
     [[node IteratorGetNextAsOptional_5 (defined at <ipython-input-31-44edcf0f3ea7>:3) ]]
0 successful operations.
5 derived errors ignored. [Op:__inference_train_function_23675]

Function call stack:
train_function -> train_function -> train_function -> train_function

Then I run it again

UnavailableError: 9 root error(s) found.
  (0) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_11/switch_pred/_107/_78]]
  (1) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[TPUReplicate/_compile/_7290104207349758044/_4/_178]]
  (2) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_13543899577889784813/_5/_281]]
  (3) Unavailable:  failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629436055.354219684","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629436055.354217763","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_37 ... [truncated] [Op:__inference_train_function_6939]

Function call stack:
train_function -> train_function -> train_function -> train_function

Must be somewhere missing strategy.scopy():

I succeeded in other notebooks but they are all tf.data.Dataset

Though, I still can't figure out this out.

Full code is at https://www.kaggle.com/dacianpeng/digit-hello-world?scriptVersionId=72464286

Version 6 is the TPU version. And only modified from Version 5 with codes above.

Dacian Peng
  • 75
  • 1
  • 6

2 Answers2

0

It looks like you are storing your training data locally which is causing the issue as TPUs can only access data in GCS.

TPUs read training data exclusively from GCS (Google Cloud Storage) see details here

You can also check this stackoverflow Colab TPU Error when calling model.fit() : UnimplementedError post.

Gagik
  • 396
  • 3
  • 6
0

Fixed the problem with changing them to tf.data.Dataset.( without GCS)

Only use local tf.data.Dataset. to call fit() is ok. But it fails with Unavailable: failed to connect to all addresses once ImageDataGenerator() used.

# Fixed with changing to tf.data.Dataset.

ds1=tf.data.Dataset.from_tensor_slices((DS1,L1)).batch(128).prefetch(-1)
ds2=tf.data.Dataset.from_tensor_slices((DS2,L2)).batch(128).prefetch(-1)

...
...


History = Model.fit(ds1, epochs=Epochs,validation_data=ds2,
                    callbacks=[ReduceLR, Stop], verbose=1)

# one epoch time is not stable, sometimes faster, sometimes slower,
# but most time it's approximately same as GPU costs

Fails once ImageDataGenerator() used.

# Fail again with ImageDataGenerator() used

ds1=tf.data.Dataset.from_generator(lambda:ImageModifier.flow(DS1,L1),output_signature=(
    tf.TensorSpec(shape=(28,28,1), dtype=tf.float32),
    tf.TensorSpec(shape=(10), dtype=tf.float32))
).batch(128).prefetch(-1)

History = Model.fit(ds1, epochs=Epochs, verbose=1)
---------------------------------------------------------------------------
UnavailableError                          Traceback (most recent call last)
<ipython-input-107-149f17c4776c> in <module>
      1 Epochs = 15
----> 2 History = Model.fit(ds1, epochs=Epochs, verbose=1)

/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1100               tmp_logs = self.train_function(iterator)
   1101               if data_handler.should_sync:
-> 1102                 context.async_wait()
   1103               logs = tmp_logs  # No error, now safe to assign to logs.
   1104               end_step = step + data_handler.step_increment

/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in async_wait()
   2328   an error state.
   2329   """
-> 2330   context().sync_executors()
   2331 
   2332 

/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/context.py in sync_executors(self)
    643     """
    644     if self._context_handle:
--> 645       pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
    646     else:
    647       raise ValueError("Context is not initialized.")

UnavailableError: 4 root error(s) found.
  (0) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[Pad_2/paddings/_130]]
  (1) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_36/_238]]
  (2) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[IteratorGetNextAsOptional_3/_35]]
  (3) Unavailable: {{function_node __inference_train_function_369954}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1629445773.854930794","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":4143,"referenced_errors":[{"created":"@1629445773.854928997","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
0 successful operations.
5 derived errors ignored.
Dacian Peng
  • 75
  • 1
  • 6