3

I can open a ctpu session and get the code I need from my git repository, but when I run my tensorflow code from the cloud shell, I get a message to say that there is no TPU and my program crashes. Here is the error message I get:

adrien_doerig@adrien-doerig:~/capser$ python TPU_playground.py
(unset)
INFO:tensorflow:Querying Tensorflow master () for TPU system metadata.
2018-07-16 09:45:49.951310: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:Failed to find TPU: _TPUSystemMetadata(num_cores=0, num_hosts=0, num_of_cores_per_host=0, topology=None, devices=[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456)])
Traceback (most recent call last):
File "TPU_playground.py", line 79, in <module>
capser.train(input_fn=train_input_fn_tpu, steps=n_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 363, in train
hooks.extend(self._convert_train_steps_to_hooks(steps, max_steps))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2068, in _convert_train_steps_to_hooks
if ctx.is_running_on_cpu():
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_context.py", line 339, in is_running_on_cpu
self._validate_tpu_configuration()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_context.py", line 525, in _validate_tpu_configuration
'are {}.'.format(tpu_system_metadata.devices))
RuntimeError: Cannot find any TPU cores in the system. Please double check Tensorflow master address and TPU worker(s). Available devices are [_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU

When I open another shell and enter "ctpu status", I see that my tpu cluster is running, but I get the following panic error:

adrien_doerig@capser-210106:~$ ctpu status

Your cluster is running!

    Compute Engine VM:  RUNNING
    Cloud TPU:          RUNNING

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x671b7e]
goroutine 1 [running]:
github.com/tensorflow/tpu/tools/ctpu/commands. 
(*statusCmd).Execute(0xc4200639e0, 0x770040, 0xc4200160d0, 0xc4200568a0, 0x0, 
0x0, 0x0, 0x6dddc0)
    /tmp/ctpu- 
release/src/github.com/tensorflow/tpu/tools/ctpu/commands/status.go:214 +0x5ce
github.com/google/subcommands.(*Commander).Execute(0xc420070000, 0x770040, 
0xc4200160d0, 0x0, 0x0, 0x0, 0x5)
    /tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:141 
+0x29f
github.com/google/subcommands.Execute(0x770040, 0xc4200160d0, 0x0, 0x0, 0x0, 
0xc420052700)
    /tmp/ctpu-release/src/github.com/google/subcommands/subcommands.go:385 
+0x5f
main.main()
    /tmp/ctpu-release/src/github.com/tensorflow/tpu/tools/ctpu/main.go:87 
+0xd5e

I tried the troubleshooting suggested here: https://cloud.google.com/tpu/docs/troubleshooting But it did not work because everything seems normal when I enter

gcloud compute tpus list

I have also tried creating a whole new project and even using a different google account but it didn’t solve the problem. I haven't found any similar errors regarding cloud TPUs. Am I missing something obvious?

Thank you for your help!

liamdalton
  • 239
  • 1
  • 7

2 Answers2

2

Ok, I figured it out:

I needed to add a master=... parameter to my RunConfig as follows (2nd line in the following code):

my_tpu_run_config = tpu_config.RunConfig(
    master=TPUClusterResolver(tpu=[os.environ['TPU_NAME']]).get_master(),
    model_dir=FLAGS.model_dir,
    save_checkpoints_secs=FLAGS.save_checkpoints_secs,
    save_summary_steps=FLAGS.save_summary_steps,
    session_config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True),
    tpu_config=tpu_config.TPUConfig(iterations_per_loop=FLAGS.iterations, num_shards=FLAGS.num_shards))

Now, the panic error still comes up when I enter 'ctpu status' (I do it from another shell where the virtual machine is not running), but I can run stuff on the cloud's TPUs anyways, i.e., the first error message from my original post doesn't occur anymore. So using the master=... parameter allows me to run my programs, but I am still unsure what the panic error means -- it may just be unimportant.

mlu
  • 137
  • 2
  • 8
2

The panic in ctpu can be ignored for now, and is caused by failure to check whether the SchedulingConfig field in the TPU Node object returned from the Cloud TPU REST API is populated (and therefore, not nil). This is solved by this PR:

https://github.com/tensorflow/tpu/pull/148

and the noise will go away once this PR is incorporated into Google Cloud Shell.

liamdalton
  • 239
  • 1
  • 7