Using cloud-tpu fails with google ai-platform train api

Question

I've been successfully using ai-platform train api with tensor2tensor and cloud-tpu backend until several days back, but it seems like something has changed and I can't get it to work since last week.

The differences I see in logs between working/non-working are '_master' and '_evaluation_master' from config.

Last successful log of train api shows something like below.

Using config: {
  '_model_dir':..., 
  ....,
  '_master': 'grpc://10.228.38.186:8470', 
  '_evaluation_master': 'grpc://10.228.38.186:8470', 
  ...
  '_cluster': None, 'use_tpu': True
}

However, the logs I see since last week are as follows.

Using config: {
  '_model_dir': ...,
  '_master': 'cmle-training-2190487948974557758-tpu', 
  '_evaluation_master': 'cmle-training-2190487948974557758-tpu', 
  ...,
  '_cluster': None, 'use_tpu': True
}

Then, tensorflow tries to connect tpu by host name, which eventually fails and the process stops.


Not found: No session factory registered for the given session options: 
{
  target: "cmle-training-4208055151697798232-tpu" 
  config: operation_timeout_in_ms: 300000
} 
Registered factories are {DIRECT_SESSION, GRPC_SESSION}.

Same code is used for both experiments.

If anybody has faced similar issue, please guide me through this. Thanks!

Using cloud-tpu fails with google ai-platform train api

0 Answers0