How to do data augmentation with tensorflow federated?

Question

I'm hoping to explore how data augmentation work on federated learning, and I'm currently using tff to implement it. I notice that the datasets provide by tff is composed of tensors, and tensors cannot be adjusted directly, so a naive idea would be to change it to numpy arrays and then do augmentation. I tried

tfds.as_numpy(emnist_train.create_tf_dataset_for_client(n)) and it did provided me with numpy arrays, but I got problems when trying to pass it to preprocess functions. If I do: preprocess(tfds.as_numpy(emnist_train.create_tf_dataset_for_client(n)))

where preprocess is defined as

def preprocess(dataset):

    def batch_format_fn(element):
        """Flatten a batch `pixels` and return the features as an `OrderedDict`."""

        return collections.OrderedDict(
            x=tf.reshape(element['pixels'], [-1, 784]),
            y=tf.reshape(element['label'], [-1, 1]))

    return dataset.repeat(NUM_EPOCHS).shuffle(SHUFFLE_BUFFER).batch(
        BATCH_SIZE).map(batch_format_fn).prefetch(PREFETCH_BUFFER)

I would get the following error:

return dataset.repeat(NUM_EPOCHS).shuffle(SHUFFLE_BUFFER).batch(
AttributeError: '_IterableDataset' object has no attribute 'repeat'

which seems to mean that this _IterableDataset object of numpy arrays cannot be applied for these methods.

And I tried wrapping tf.data.Dataset.from_tensor_slices method as tf.data.Dataset.from_tensor_slices(tfds.as_numpy(emnist_train.create_tf_dataset_for_client(n))), but it ends up with this error:

ValueError: Attempt to convert a value (<tensorflow_datasets.core.dataset_utils._IterableDataset object at 0x00000280AA695DF0>) with an unsupported type (<class 'tensorflow_datasets.core.dataset_utils._IterableDataset'>) to a Tensor.

Is there any way to solve this problem? Or could I do augmentation just on the data it provides?

Updates

It would be enough to just use map function if I only want to convert each sample in the dataset to a augmented one. However, if I want to add new samples to the dataset(e.g. adding samples of different labels), how can I do it? Since we can't modify the client dataset directly, I was thinking converting it to a numpy array and make further processing, yet if I do:

state, metrics = iterative_process.next(state, tfds.as_numpy(federated_train_data)) where federated_train_data is a client dataset, I got

TypeError: Expected tensorflow.python.data.ops.dataset_ops.DatasetV2 or tensorflow.python.data.ops.dataset_ops.DatasetV1, found tensorflow_datasets.core.dataset_utils._IterableDataset.

Seems this _IterableDataset couldn't be applied to the process. Is there a way that I can convert this dataset back to what is acceptable by tff.learning.build_federated_averaging_process()? Or is there a better way to do this kind of augmentation?

Update 2

I was trying to use a generator from a GAN model to generate new images to augment the dataset. I have a pretrained GAN (written by tf.keras), and I wrote a dataGenerator to wrap this model for augmenting the client datasets. However, when I do the fed-avg training, the following error occurred:

  File "D:\Research\GAN_AUG_FL\utils\augment_utils.py", line 53, in generate_once
    generated_images = generator(generator_input)
  File "D:\Research\GAN_AUG_FL\venv\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 665, in __call__
    self._assert_built_as_v1()
  File "D:\Research\GAN_AUG_FL\venv\lib\site-packages\tensorflow\python\keras\engine\base_layer_v1.py", line 836, in _assert_built_as_v1
    raise ValueError(
ValueError: Your Layer or Model is in an invalid state. This can happen for the following cases:
 1. You might be interleaving estimator/non-estimator models or interleaving models/layers made in tf.compat.v1.Graph.as_default() with models/layers created outside of it. Converting a model to an estimator (via model_to_estimator) invalidates all models/layers made before the conversion (even if they were not the model converted to an estimator). Similarly, making a layer or a model inside a a tf.compat.v1.Graph invalidates all layers/models you previously made outside of the graph.
2. You might be using a custom keras layer implementation with  custom __init__ which didn't call super().__init__.  Please check the implementation of <class 'tensorflow.python.keras.engine.functional.Functional'> and its bases.

Here generator is just the keras model for generation. I suspect this is because in tff the computation graph is different from the one I used to create an instance of the generator model. The code for training is just like the toturial here.

emnist_train, emnist_test = tff.simulation.datasets.emnist.load_data(only_digits=True, cache_dir="data/emnist")

example_dataset = emnist_train.create_tf_dataset_for_client(emnist_train.client_ids[0])
example_dataset = preprocess(example_dataset)

def model_fn():
    keras_model = create_keras_model()
    return tff.learning.from_keras_model(
        keras_model,
        input_spec=example_dataset.element_spec,
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
    )


iterative_process = tff.learning.build_federated_averaging_process(
    model_fn,
    client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
    server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0),
)

state = iterative_process.initialize()

# state, metrics = iterative_process.next(state, [example_dataset])
# print('round  1, metrics={}'.format(metrics))


for round_num in range(NUM_ROUNDS):
    selected_clients = random.sample(emnist_train.client_ids, 1)
    federated_data = [
        preprocess(emnist_train.create_tf_dataset_for_client(n))
        for n in selected_clients
    ]
    state, metrics = iterative_process.next(state, federated_data)
    print(f"round  {round_num + 1}, metrics={metrics}")

But at this point strange things happen if I uncomment the two lines before going into the loop. This time the training could go smoothly, but still reports the same bug after going into the loop. Therefore I guess after the first time this preprocess is done, tff is using some different graph? Is there any possible solution to it?

score 0 · Answer 1 · answered Apr 15 '21 at 04:33

0

I think the answer is to just do the augmentation directly on the tf.data.Dataset objects, using tf.data.Dataset.map, following for instance this TF tutorial.

The main reason is that TFF serializes the computations and executes them later, so whatever processing you are doing in Python / numpy, would need to happen beforehand. In principle, you could also load, augment, and save a dataset using py/np, and then provide the augmented dataset to TFF, but that would likely be a complicated workaround.

answered Apr 15 '21 at 04:33

Jakub Konecny

900
5
15

Thanks for your advice! Seems a augment function via mapping is suitable for it. I got trapped with the idea of doing augmentations over numpy arrays before. – Caplimbo Apr 17 '21 at 01:53
I was trying to do a different kind of augmentation, this time I cannot do it just by map. Any possible solutions to it? – Caplimbo Aug 23 '21 at 02:39
From updated Q, I assume you want to add new elements to client's data. This should, too, be done outside of TFF, in `tf.data` manipulation. `tf.data.Dataset.concatenate` might be a good starting point. In general, https://www.tensorflow.org/guide/data could be a useful read. – Jakub Konecny Aug 23 '21 at 08:34
Dataset.concatenate seems to be a good way. Thanks again! – Caplimbo Aug 24 '21 at 12:38
Still got further problems when deploying another model for generating samples to augment the client dataset. Is it related with the graph setting of tff? – Caplimbo Aug 26 '21 at 12:28

How to do data augmentation with tensorflow federated?

Updates

Update 2

1 Answers1