ResNet model in Tensorflow Federated

Question

I tried to customize the model in "Image classification" tutorial in Tensorflow Federated. (It originally used a sequential model) I use Keras ResNet50 but when it began to train, there is always an error "Incompatible shapes"

Here are my codes:

NUM_CLIENTS = 4
NUM_EPOCHS = 10
BATCH_SIZE = 2
SHUFFLE_BUFFER = 5

def create_compiled_keras_model():
  model = tf.keras.applications.resnet.ResNet50(include_top=False, weights='imagenet', 
                                                input_tensor=tf.keras.layers.Input(shape=(100, 
                                                300, 3)), pooling=None)

  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.SGD(learning_rate=0.02),
      metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])
  return model


def model_fn():
  keras_model = create_compiled_keras_model()
  return tff.learning.from_compiled_keras_model(keras_model, sample_batch)

iterative_process = tff.learning.build_federated_averaging_process(model_fn)

Error information: enter image description here

I feel that the shape is incompatible because the epoch and clients information were somehow missing. Would be very thankful if someone could give me a hint.

Updates:

The Assertion error happened during tff.learning.build_federated_averaging_process

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-164-dac26193d9d8> in <module>()
----> 1 iterative_process = tff.learning.build_federated_averaging_process(model_fn)
      2 
      3 # iterative_process = build_federated_averaging_process(model_fn)

13 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/learning/federated_averaging.py in build_federated_averaging_process(model_fn, server_optimizer_fn, client_weight_fn, stateful_delta_aggregate_fn, stateful_model_broadcast_fn)
    165   return optimizer_utils.build_model_delta_optimizer_process(
    166       model_fn, client_fed_avg, server_optimizer_fn,
--> 167       stateful_delta_aggregate_fn, stateful_model_broadcast_fn)

/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/learning/framework/optimizer_utils.py in build_model_delta_optimizer_process(model_fn, model_to_client_delta_fn, server_optimizer_fn, stateful_delta_aggregate_fn, stateful_model_broadcast_fn)
    349   # still need this.
    350   with tf.Graph().as_default():
--> 351     dummy_model_for_metadata = model_utils.enhance(model_fn())
    352 
    353   # ===========================================================================

<ipython-input-159-b2763ace8e5b> in model_fn()
      1 def model_fn():
      2   keras_model = model
----> 3   return tff.learning.from_compiled_keras_model(keras_model, sample_batch)

/usr/local/lib/python3.6/dist-packages/tensorflow_federated/python/learning/keras_utils.py in from_compiled_keras_model(keras_model, dummy_batch)
    211   # Model.test_on_batch() once before asking for metrics.
    212   if isinstance(dummy_tensors, collections.Mapping):
--> 213     keras_model.test_on_batch(**dummy_tensors)
    214   else:
    215     keras_model.test_on_batch(*dummy_tensors)

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py in test_on_batch(self, x, y, sample_weight, reset_metrics)
   1007         sample_weight=sample_weight,
   1008         reset_metrics=reset_metrics,
-> 1009         standalone=True)
   1010     outputs = (
   1011         outputs['total_loss'] + outputs['output_losses'] + outputs['metrics'])

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in test_on_batch(model, x, y, sample_weight, reset_metrics, standalone)
    503       y,
    504       sample_weights=sample_weights,
--> 505       output_loss_metrics=model._output_loss_metrics)
    506 
    507   if reset_metrics:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py in __call__(self, *args, **kwds)
    568         xla_context.Exit()
    569     else:
--> 570       result = self._call(*args, **kwds)
    571 
    572     if tracing_count == self._get_tracing_count():

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py in _call(self, *args, **kwds)
    606       # In this case we have not created variables on the first call. So we can
    607       # run the first trace but we should fail if variables are created.
--> 608       results = self._stateful_fn(*args, **kwds)
    609       if self._created_variables:
    610         raise ValueError("Creating variables on a non-first call to a function"

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py in __call__(self, *args, **kwargs)
   2407     """Calls a graph function specialized to the inputs."""
   2408     with self._lock:
-> 2409       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
   2410     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2411 

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py in _maybe_define_function(self, args, kwargs)
   2765 
   2766       self._function_cache.missed.add(call_context_key)
-> 2767       graph_function = self._create_graph_function(args, kwargs)
   2768       self._function_cache.primary[cache_key] = graph_function
   2769       return graph_function, args, kwargs

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   2655             arg_names=arg_names,
   2656             override_flat_arg_shapes=override_flat_arg_shapes,
-> 2657             capture_by_value=self._capture_by_value),
   2658         self._function_attributes,
   2659         # Tell the ConcreteFunction to clean up its graph once it goes out of

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    979         _, original_func = tf_decorator.unwrap(python_func)
    980 
--> 981       func_outputs = python_func(*func_args, **func_kwargs)
    982 
    983       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/def_function.py in wrapped_fn(*args, **kwds)
    437         # __wrapped__ allows AutoGraph to swap in a converted function. We give
    438         # the function a weak reference to itself to avoid a reference cycle.
--> 439         return weak_wrapped_fn().__wrapped__(*args, **kwds)
    440     weak_wrapped_fn = weakref.ref(wrapped_fn)
    441 

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/func_graph.py in wrapper(*args, **kwargs)
    966           except Exception as e:  # pylint:disable=broad-except
    967             if hasattr(e, "ag_error_metadata"):
--> 968               raise e.ag_error_metadata.to_exception(e)
    969             else:
    970               raise

AssertionError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_eager.py:345 test_on_batch  *
        with backend.eager_learning_phase_scope(0):
    /usr/lib/python3.6/contextlib.py:81 __enter__
        return next(self.gen)
    /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py:425 eager_learning_phase_scope
        assert ops.executing_eagerly_outside_functions()

    AssertionError:

Please copy your stack trace as text, instead of posting it as an image. — thushv89, Jan 07 '20 at 05:20
Great question! Can we see where `sample_batch` is coming from? — Keith Rush, Jan 07 '20 at 17:44
Sorry I tried commenting in code format but it looked so messy. `sample_batch` is training data in a batch. For example, if the batch size is 2, then `sample_batch` is `OrderedDict([('x', array([], []), ('y', array([], []))])` — Miao Zhang, Jan 07 '20 at 22:15
Please can you see this issue https://stackoverflow.com/questions/60060493/tff-invalid-argument-default-maxpoolingop-only-supports-nhwc-on-device-type-cp — Eliza, Feb 05 '20 at 06:59

score 1 · Answer 1 · answered Jan 08 '20 at 16:52

1

Ah, I believe this issue is coming from mismatched expectations on sample_batch. TFF passes sample_batch to Keras, which calls a forward pass with this sample batch to initialize various attributes of the keras model. sample_batch should be either a sample from the literal data you are going to be feeding the model as on the server side, or a batch of fake data which matches the shape and type of the data you will be passing in.

An example of the former can be found here (this uses tf.data.Dataset), and there are several examples of the latter in test code, like here.

From what I see of the definition of the model, likely the x element of your sample_batch should be an ndarray of shape [2, 100, 300, 3] (where 2 is for the batch size, but technically this can be any nonzero dimension), and the y element should also match the expected y structure in the data you are using.

I hope this helps, just ping back if there are any problems!

One thing to note, that may be helpful in thinking about TFF--TFF is building a syntax tree representing the distributed computation you are defining via build_federated_averaging_process. This error actually occurs during construction of this object. TFF must trace the computation you pass it in order to know what structure to generate, and this is what is raising here. Actual training of the model happens when you call next on the returned IterativeProcess.

answered Jan 08 '20 at 16:52

Keith Rush

1,360
7
6

Thank you so much! I adjust the output layer of the model to match my y shape and solved the shame incompatible problem. I successfully got a `tff.learning.model` from tff.learning.from_compiled_keras_model, but when I run `iterative_process = tff.learning.build_federated_averaging_process(model_fn)` , there is a AssertionError: which is really hard to track. Have spent a lot of time but no idea how it came. Could you please give me any suggestions? The error was attached in the update of the question. – Miao Zhang Jan 08 '20 at 18:48
Hm, this is interesting, we just saw something like this from another source. I'll dig in – Keith Rush Jan 08 '20 at 19:57
Sounds good! Please feel free to let me know for any findings. Thanks a lot! – Miao Zhang Jan 08 '20 at 23:10
Just a quick update: we've been diving deep on an internal repro, almost ready to file a bug--as far as we can tell, something inside of Keras training is not being set correctly for some reason. I'll link that bug here if/when we file it. – Keith Rush Jan 16 '20 at 17:49
Cool! And not sure if the information helps but I found that the assertion happened when I applied Keras Model class, even though I just use the class to set 1 dense layer. However, if I use keras.sequential(), there's no such issue. And applications like keras.applications.resnet.ResNet50 has no issue either. – Miao Zhang Jan 17 '20 at 00:44
There is a hypothesis leading to the following suggestion: wherever you import tensorflow in your script, can you try moving the line `tf.compat.v1.enable_v2_behavior()` to *the next line* after this import? – Keith Rush Jan 17 '20 at 01:19
I removed this line but still got the same error...:( Should I try disabling v2 behavior? – Miao Zhang Jan 17 '20 at 05:05
It was solved! It turned out that the model has to be built and returned in `create_compiled_keras_model()`, rather than be built outside the function, which might be the reason! – Miao Zhang Jan 17 '20 at 07:37
Ah, yes, this makes sense. TFF needs to control virtually everything about model instantiation (especially since instantiating a model can create `tf.Variables`, which may not play well with the remainder of the TF/TFF runtime). Glad to hear there is a solution! – Keith Rush Jan 17 '20 at 22:04

score 0 · Answer 2 · answered Jan 30 '20 at 09:12

I have same problem: if I execute this line state, metrics = iterative_process.next(state, federated_train_data) print('round 1, metrics={}'.format(metrics))

I find this error InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Default MaxPoolingOp only supports NHWC on device type CPU [[{{node StatefulPartitionedCall/StatefulPartitionedCall/sequential/vgg16/block1_pool/MaxPool}}]] [[subcomputation/StatefulPartitionedCall_1/ReduceDataset]] [[subcomputation/StatefulPartitionedCall_1/ReduceDataset/_140]] (1) Invalid argument: Default MaxPoolingOp only supports NHWC on device type CPU [[{{node StatefulPartitionedCall/StatefulPartitionedCall/sequential/vgg16/block1_pool/MaxPool}}]] [[subcomputation/StatefulPartitionedCall_1/ReduceDataset]] 0 successful operations. 0 derived errors ignored.

knowin that I employe VGG16 have you any idea on this type of error

ResNet model in Tensorflow Federated

2 Answers2

Linked