GradientTape for variable weighted sum of two Sequential models in TensorFlow

Question

Suppose we want to minimize the following equation using gradient descent:

min f(alpha * v + (1-alpha)*w) with v and w the model weights and alpha the weight, between 0 and 1, for the sum resulting in the combined model v_bar or ū (here referred to as m).

alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))
w_weights = tff.learning.ModelWeights.from_model(w)
v_weights = tff.learning.ModelWeights.from_model(v)
m_weights = tff.learning.ModelWeights.from_model(m)

m_weights_trainable = tf.nest.map_structure(lambda v, w: alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable)
tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable, m_weights_trainable)

In the paper of Adaptive Personalized Federated Learning, formula with update step for alpha suggests updating alpha based on the gradients of model m applied on a minibatch. I tried it with the watch or without, but it always leads to No gradients provided for any variable

with tf.GradientTape(watch_accessed_variables=False) as tape:
   tape.watch([alpha])
   outputs_m = m.forward_pass(batch)
grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))

Some more information about the initialization of the models:

The m.forward_pass(batch) is the default implementation from tff.learning.Model (found here) by creating a model with tff.learning.from_keras_model and a tf.keras.Sequential model.

def model_fn():
   keras_model = create_keras_model()
   return tff.learning.from_keras_model(
     keras_model,
     input_spec = element_spec,
     loss = tf.keras.losses.MeanSquaredError(),
     metrics = [tf.keras.metrics.MeanSquaredError(),
                tf.keras.metrics.MeanAbsoluteError()],
   )
w = model_fn()
v = model_fn()
m = model_fn()

Some more experimenting as suggested below by Zachary Garrett:

It seems that whenever this weighted sum is calculated, and the new weights for the model are assigned, then it loses track of the previous trainable variables of both summed models. Again, it leads to the No gradients provided for any variable whenever optimizer.apply_gradients(zip([grad], [alpha])) is called. All gradients seem to be None.

with tf.GradientTape() as tape:
   alpha = tf.Variable(0.01, name='Alpha', constraint=lambda t: tf.clip_by_value(t, 0, 1))

   m_weights_t = tf.nest.map_structure(lambda w, v: tf.math.scalar_mul(alpha, v, name=None) + tf.math.scalar_mul(tf.constant(1.0) - alpha, w, name=None),
                                w.trainable,
                                v.trainable)

   m_weights = tff.learning.ModelWeights.from_model(m)
   tf.nest.map_structure(lambda v, t: v.assign(t), m_weights.trainable,
                  m_weights_trainable)

   outputs_m = m.forward_pass(batch)

grad = tape.gradient(outputs_m.loss, alpha)
optimizer.apply_gradients(zip([grad], [alpha]))

Another edit: I think I have a strategy to get it working, but it is bad practice as manually setting trainable_weights or _trainable_weights does not work. Any tips on improving this?

  def do_weighted_combination():

    def _mapper(target_layer, v_layer, w_layer):
      target_layer.kernel = v_layer.kernel * alpha + w_layer.kernel * (1-alpha)
      target_layer.bias = v_layer.bias * alpha + w_layer.bias * (1-alpha)

    tf.nest.map_structure(_mapper, m.layers, v.layers, w.layers)


  with tf.GradientTape(persistent=True) as tape: 
    do_weighted_combination()

    predictions = m(x_data)
    loss = m.compiled_loss(y_data, predictions)


  g1 = tape.gradient(loss, v.trainable_weights) # Not None
  g2 = tape.gradient(loss, alpha) # Not None

Could the question we extended to show how `m.forward_pass` is implemented? — Zachary Garrett, Jun 03 '22 at 18:33

score 0 · Answer 1 · answered Jun 03 '22 at 18:37

0

For TensorFlow auto-differentiation using tf.GradientTape, operations must occur within the tf.GradientTape Python context manager so that TensorFlow can "see" them.

Possibly what is happening here is that alpha is used outside/before the tape context, when setting the model variables. Then when m.forwad_pass is called TensorFlow doesn't see any access to alpha and thus can't compute a gradient for it (instead returning None).

Moving the

alpha*v + (tf.constant(1.0) - alpha)*w, v_weights.trainable, w_weights.trainable

logic inside the tf.GradientTape context manager (possibly inside m.forward_pass) may be a solution.

answered Jun 03 '22 at 18:37

Zachary Garrett

2,911
15
23

In the documentation of `tff.learning.Model` it states one needs to define the variables during the initialization of the model. So I'll give that a try first, it is probably able to see the variables afterwards. – user19087072 Jun 12 '22 at 08:58
It is however not clear how to do this with `tff.learning.from_keras_model`. Looking into the source on GitHub shows that it does not have the option to do this. Does this mean I have to implement a slightly modified version of this `_Keras_Model`, with the other variables included? Or might it be easier to implement a custom `forward_pass`? – user19087072 Jun 12 '22 at 09:15
If `w`, `v` and `m` are all `tff.learning.Model` (presumably due to the use of `ModelWeights.from_model`), any chance they are Keras models before that? Using the [Functional Keras Model API](https://keras.io/guides/functional_api/#all-models-are-callable-just-like-layers), its possible to create a combined Keras model from two models first, then use it will `tff.learning.from_keras_model`? – Zachary Garrett Jun 12 '22 at 14:39
Your guess is correct. I use indeed `tff.learning.from_keras_model` for the models before I sum the weights. A first look at the documentation of the Functional Keras API, shows mostly the average of the inputs and outputs of layers. I assume there is no real function implemented as [this](https://stackoverflow.com/a/69709168/19087072) seems to provide me what I need in terms of Keras logic? But then with a custom trainable parameter alpha like [this post](https://stackoverflow.com/a/64036147/19087072)? – user19087072 Jun 12 '22 at 15:48
I added some more information to my question of what I tried regarding your suggestion of the placement of `alpha` and the weighted sum. Is it possible that it loses track of the variables because it somehow calculates the numerical result of this sum in the `tf.nest.map_structure` ? Or might it be because of the assignment of these variables to the new model? – user19087072 Jun 14 '22 at 11:27
I would have expected using the `WeightedSum` layer technique in the second post to create a new model that composes both `w` and `v` (replacing `m`) would have worked, was this not succesful? – Zachary Garrett Jun 14 '22 at 12:51
I gave it a try but stopped because my goal is not to have the weighted sum of the outputs of the models, but of the internal weights of both models. If I am not mistaken, this is not the same due to the nonlinearity of neural networks? I was also not able to see how I could achieve this with the internal weights. It might be unclear from my question but in fact, I also need `tape.gradient(outputs_m.loss, v.trainable)` to update `v` based on the sum of `v` and `w`. In that scenario, the gradients are all `None` as well. – user19087072 Jun 14 '22 at 16:28

GradientTape for variable weighted sum of two Sequential models in TensorFlow

1 Answers1