1

TensorFlow Estimator is easy to use for distributed training with parameter server strategy. But I cannot do prediction with the parameter server strategy. I cannot find any resource to introduce the part.

prediction sample code:

    run_config = tf.estimator.RunConfig()
    model = tf.estimator.Estimator(
        model_fn=self.model_fn,
        model_dir=self._config.model_path,
        config=run_config,
        params=self.params())
    results = model.predict(
        input_fn=lambda: test_data.build(
            batch_size=self._config.eval_batch_size,
            num_epochs=1))

TF_CONFIG:

{'task': {'index': '0', 'type': 'ps'}, 'cluster': {'chief': ['127.0.0.1:2320'], 'ps': ['127.0.0.1:2220', '127.0.0.1:2221']}}
{'task': {'index': '1', 'type': 'ps'}, 'cluster': {'chief': ['127.0.0.1:2320'], 'ps': ['127.0.0.1:2220', '127.0.0.1:2221']}}
{'task': {'index': '0', 'type': 'chief'}, 'cluster': {'chief': ['127.0.0.1:2320'], 'ps': ['127.0.0.1:2220', '127.0.0.1:2221']}}

Result: Both PS and Woker did prediction.

Any suggestion? Thanks a lot.

nolan liou
  • 115
  • 1
  • 9

1 Answers1

0

In Estimator predict, every ps and worker use MonitoredSession to start a node which restores from an existing checkpoint. In order to do a distributed prediction, you can refer to Estimator training.

  • start ps
  • run_worker Create MonitoredTrainingSession instead of a MonitoredSession
    • Remember to start worker server.
  • estimator.predict receives a path for checkpoint, MonitoredTrainingSession receives a directory for checkpoint.

You can successfully start all servers and a distributed prediction. But there will be warnings such as that the global step is not increasing.

Detailed code on Github

chunyang.wen
  • 204
  • 1
  • 9