GCP and TPU, experimental_connect_to_cluster give no response

Question

I am trying to use TPU on GCP with tensorflow 2.1 with Keras API. Unfortunately, I am stuck after creating the tpu-node. In fact, it seems that my VM "see" the tpu, but could not connect to it.

The code I am using :

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

The code is stuck line 3, I received few messages and then nothing, so I do not know what could be the issue. Therefore I am suspecting some connection's issue between the VM and the TPU.

The message :

2020-04-22 15:46:25.383775: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-04-22 15:46:25.992977: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz 2020-04-22 15:46:26.042269: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5636e4947610 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-04-22 15:46:26.042403: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-04-22 15:46:26.080879: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. E0422 15:46:26.263937297 2263 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1587570386.263923266","description":"SO_REUSEPORT unavailable on compiling system","file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":166} 2020-04-22 15:46:26.269134: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.163.38.90:8470} 2020-04-22 15:46:26.269192: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32263}

Moreover, I am using the "Deep Learning" Image from gcp, so I should not need to install anything, right ?

Does anyone have the same issue with TF 2.1 ? P.S : the same code works fine on Kaggle and Colab.

score 2 · Accepted Answer · answered Apr 28 '20 at 19:22

Trying to reproduce, I used ctpu up --zone=europe-west4-a --disk-size-gb=50 --machine-type=n1-standard-8 --tf-version=2.1 to create vm and tpu. Then ran your code, and it succeeded.

taylanbil@taylanbil:~$ python3 run.py 
Running on TPU  grpc://10.240.1.2:8470
2020-04-28 19:18:32.597556: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-28 19:18:32.627669: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000189999 Hz
2020-04-28 19:18:32.630719: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x471b980 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-28 19:18:32.630759: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-28 19:18:32.665388: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-04-28 19:18:32.665439: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
2020-04-28 19:18:32.683216: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-04-28 19:18:32.683268: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
2020-04-28 19:18:32.690405: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:33355
taylanbil@taylanbil:~$ cat run.py 
import tensorflow as tf
TPU_name='taylanbil'
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

How do you create your tpu resources? Can you double check if there is no version mismatch?

Hi,Thanks for your answer. It seems my issue probably comes from the subnetwork I am using when I created the VM. It has some firewall rules which does not allow the communication with the TPU I think. I need to wait the update of the firewall rule (I do not have the permission to do that where I am working) — Shiro, Apr 28 '20 at 20:13
Ok, makes sense. In that case, if you need further assistance, I suggest that you file a bug via the google cloud ui. — taylanbil, Apr 29 '20 at 17:34
@Shiro can you provide more detail on how to solve the subnetwork problem ? I'm having similar issue, building the TPU+VM with `ctpu up` does not help, and creating a firewall rule allowing everything does not work either... — Astariul, Jun 22 '20 at 07:19
@Astariul Hi, are you using a subnetwork when creating your VM ? if it is the case, the easiest way is to create a vm without subnetwork, then you should not have any issue to connect to the TPU — Shiro, Jun 22 '20 at 08:52
@Shiro thanks for the fast answer ! I'm not aware using any subnetwork. My command is : `ctpu up --zone=europe-west4-a --disk-size-gb=50 --machine-type=n1-standard-2 --tf-version2.2 --tpu-size v3-8 --name mytpu` — Astariul, Jun 22 '20 at 12:26
@Astariul I have never used the command line, until now I was using just the UI. But from what I understand you are creating a VM with TF preinstalled and link a TPU. So there should not be any issue. Can you double check if the ip adress and the port of the TPU is not misspelled in your python script ? — Shiro, Jun 22 '20 at 13:46
@Shiro thanks for your help. I double-checked and the TPU IP is fine. I can see the TPU from the VM using `nmap -Pn -p8470 TPUIP`, but I simply cannot connect to it... — Astariul, Jun 22 '20 at 23:38
Can you just show a little piece of your script just in case because I do not understand why it is not working, it seems very strange — Shiro, Jun 23 '20 at 16:13

score 1 · Answer 2 · answered Jul 03 '20 at 01:52

I created my VM + TPU with ctpu up --zone=europe-west4-a --disk-size-gb=50 --machine-type=n1-standard-2 --tf-version=2.2 --tpu-size v3-8 --name cola-tpu

But still I couldn't access the TPU, it hanged like OP described.

I opened a Google issue and got the answer there :

This is a known issue that occurs some times and the product team is currently trying to fix it.

While that happens, let me propose some troubleshooting steps:

1- Disable and then reenable TPU API

If this does not work:

2.1- Go to VPC network > VPC network peering

2.2- Check if cp-to-tp-peeringdefault[somenumbers] has inactive status.

2.3- If it does, delete it and create a tpu node again

Please let us know if any of this worked for you so that we can close this ticket (in case it did) or keep with providing support (in case it did not).

For me, deleting cp-to-tp-peeringdefault and recreating VM + TPU worked.

GCP and TPU, experimental_connect_to_cluster give no response

2 Answers2