This is an excellent question, and highlights some of the intricacies of the federated setting.
In short, unfortunately, there is no single answer here except: it depends. Let's take a few examples.
In the paper Improving Federated Learning Personalization via Model Agnostic Meta Learning, it is argued that for a personalization application, evaluation should be weighted on the per-client level, independent of how much data each client holds. This argument is intuitively reasonable: supposing we are using federated personalization in a mobile application, we may wish to optimize for the average future user's experience, which is better modeled by the per-client weighted average than the per-example weighted average. This is to say, we do not wish to make our application work better for those that use it more, rather we wish to make our application work better on average across users. Further, that referenced paper employs a 4-way split; clients are first partitioned into train and test clients, then the data on each client is partitioned into data to use for the personalization task and data on which to evaluate the personalized model.
This may be fundamentally different than the concerns present in a different problem domain. For example, in the cross-silo FL setting, one might imagine that samples are coming from identical distributions, yet for some reason one silo holds more data than the others. One could imagine here a medical environment (making the rather unrealisitic assumption that there are no latent factors here), where we assume that e.g. medical images are being sampled from the same distribution, but a larger provider simply has more of them. In this setting I think it is reasonable that we would evaluate the model we train on a per-example basis, as the user-client mapping breaks down, and the users for which we wish to deploy our model maps better to "example" than "client" here (client mapping of course to the silo in this setting).
I think other problem settings would call for other evaluation strategies, including things like median accuracy across clients or minimum accuracy across clients.
Like in all data-science or ML applications, we should think hard in FL about exactly what we are trying to optimize for, and tailor our evaluation to this metric. I think the main difference in FL is that this issue is more clear on the front-end, which in my view is a feature of the framework.
In TensorFlow Federated, the various methods of computing/aggregating metrics across clients can be adjusted by changing the federated_output_computation
attribute on your tff.learning.Model
, then passing this model (or rather, a model-building function) to build_federated_evaluation_process
.