I've been successfully using ai-platform train api with tensor2tensor and cloud-tpu backend until several days back, but it seems like something has changed and I can't get it to work since last week.
The differences I see in logs between working/non-working are '_master' and '_evaluation_master' from config.
Last successful log of train api shows something like below.
Using config: {
'_model_dir':...,
....,
'_master': 'grpc://10.228.38.186:8470',
'_evaluation_master': 'grpc://10.228.38.186:8470',
...
'_cluster': None, 'use_tpu': True
}
However, the logs I see since last week are as follows.
Using config: {
'_model_dir': ...,
'_master': 'cmle-training-2190487948974557758-tpu',
'_evaluation_master': 'cmle-training-2190487948974557758-tpu',
...,
'_cluster': None, 'use_tpu': True
}
Then, tensorflow tries to connect tpu by host name, which eventually fails and the process stops.
Not found: No session factory registered for the given session options:
{
target: "cmle-training-4208055151697798232-tpu"
config: operation_timeout_in_ms: 300000
}
Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Same code is used for both experiments.
If anybody has faced similar issue, please guide me through this. Thanks!