I'm carrying out a federated learning process and use the function tff.learning.build_federated_averaging_process to create an iterative process of federated learning. As mentioned in the TFF tutorial, this function has two arguments called client_optimizer_fn and server_optimizer_fn, which in my opinion, represent the optimizer for client and server, respectively. But in the FedAvg paper, it seems that only clients carry out the optimization while the server only do the averaging operation, so what exactly is the server_optimizer_fn doing and what does its learning rate mean?
2 Answers
In McMahan et al., 2017, the clients communicate the model weights after local training to the server, which are then averaged and re-broadcast for the next round. No server optimizer needed, the averaging step updates the global/server model.
tff.learning.build_federated_averaging_process
takes a slight different approach: the delta of the model weights the client received and the model weights after local training is sent back to the server. This delta can be though of as a pseudo-gradient, allowing the server to apply it to the global model using standard optimization techniques. Reddi et al., 2020 delves into this formulation and how adaptive optimizers (Adagrad, Adam, Yogi) on the server can greatly improve convergence rates. Using SGD without momentum as the server optimizer, with a learning rate of 1.0
, exactly recovers the method described in McMahan et al., 2017.

- 2,911
- 15
- 23
Thank you for your answer Zachary. In McMahan et al., 2017, two ways to implement the federated learning are introduced, either calculating the average gradients of clients and send them to the server to do the averaging operation, or applying the average gradients to each client model and send the client models to the server to do the averaging operation. The Algorithm 1 of McMahan et al., 2017 uses the second way to implement the federated learning, while the TFF use the first way according to your reply. What makes me confused is that in my opinion there should be only one learning rate no matter which way TFF use, that is , for the first way there should be only server lr and no client lr, and for the second way there should be only client lr and no server lr. Just as mentioned in McMahan et al., 2017, there is only one symbol Eta to represent the lr, no Eta_client or Eta_server.

- 61
- 2
-
1Perhaps the two methods being referred to in McMahan are the FedAvg algorithm and the FedSGD algorithm? FedSGD computes gradients without updating the client model, while FedAvg takes many SGD steps locally (updating the client model) before sending back a new model (or model delta). You're quite right that the former only has a server learning rate. In the paper, the latter effectively has a server learning rate of `1.0`; no scaling of the client updates (also the default of `tff.learning.build_federated_averaging_process`). – Zachary Garrett Aug 23 '20 at 03:27