Suppose we want to minimize the following equation using gradient descent:
min f(alpha * v + (1-alpha)*w)
with v
and w
the model weights and alpha
the weight, between 0 and 1, for the sum resulting in the combined model v_bar
or ū
(here referred to as m
).
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
w_weights = tff.learning.ModelWeights.from_model(w)
v_weights = tff.learning.ModelWeights.from_model(v)
m_weights = tff.learning.ModelWeights.from_model(m)
m_weights_trainable = tf.nest.map_structure(lambda v, w: alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable, m_weights_trainable)
In the paper of Adaptive Personalized Federated Learning, formula with update step for alpha suggests updating alpha based on the gradients of model m
applied on a minibatch. I tried it with the watch or without, but it always leads to No gradients provided for any variable
with tf.GradientTape(watch_accessed_variables=False) as tape:
tape.watch([alpha])
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Some more information about the initialization of the models:
The m.forward_pass(batch)
is the default implementation from tff.learning.Model
(found here) by creating a model with tff.learning.from_keras_model
and a tf.keras.Sequential
model.
def model_fn():
keras_model = create_keras_model()
return tff.learning.from_keras_model(
keras_model,
input_spec = element_spec,
loss = tf.keras.losses.MeanSquaredError(),
metrics = [tf.keras.metrics.MeanSquaredError(),
tf.keras.metrics.MeanAbsoluteError()],
)
w = model_fn()
v = model_fn()
m = model_fn()
Some more experimenting as suggested below by Zachary Garrett:
It seems that whenever this weighted sum is calculated, and the new weights for the model are assigned, then it loses track of the previous trainable variables of both summed models. Again, it leads to the No gradients provided for any variable
whenever optimizer.apply_gradients(zip([grad], [alpha]))
is called. All gradients seem to be None
.
with tf.GradientTape() as tape:
alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
m_weights_t = tf.nest.map_structure(lambda w, v: tf.math.scalar_mul(alpha, v, name=None) + tf.math.scalar_mul(tf.constant(1.0) - alpha, w, name=None),
w.trainable,
v.trainable)
m_weights = tff.learning.ModelWeights.from_model(m)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable,
m_weights_trainable)
outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))
Another edit:
I think I have a strategy to get it working, but it is bad practice as manually setting trainable_weights
or _trainable_weights
does not work. Any tips on improving this?
def do_weighted_combination():
def _mapper(target_layer, v_layer, w_layer):
target_layer.kernel = v_layer.kernel * alpha + w_layer.kernel * (1-alpha)
target_layer.bias = v_layer.bias * alpha + w_layer.bias * (1-alpha)
tf.nest.map_structure(_mapper, m.layers, v.layers, w.layers)
with tf.GradientTape(persistent=True) as tape:
do_weighted_combination()
predictions = m(x_data)
loss = m.compiled_loss(y_data, predictions)
g1 = tape.gradient(loss, v.trainable_weights) # Not None
g2 = tape.gradient(loss, alpha) # Not None