0

I am learning StyleGAN architecture and I got confused about the purpose of the Mapping Network. In the original paper it says:

Our mapping network consists of 8 fully-connected layers, and the dimensionality of all input and output activations— including z and w — is 512.

And there is no information about this network being trained in any way.

Like, wouldn’t it just generate some nonsense values?

I've tried creating a network like that (but with a smaller shape (16,)):

import tensorflow as tf
import numpy as np

model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(16)))

for i in range(7):
  model.add(tf.keras.layers.Dense(16, activation='relu'))

model.compile()

and then evaluated it on some random values:

g = tf.random.Generator.from_seed(34)
model(
    g.normal(shape=(16, 16))
)

And I am getting some random outputs like:

array([[0.        , 0.01045225, 0.        , 0.        , 0.02217731,
        0.00940356, 0.02321716, 0.00556996, 0.        , 0.        ,
        0.        , 0.03117323, 0.        , 0.        , 0.00734158,
        0.        ],
       [0.03159791, 0.05680077, 0.        , 0.        , 0.        ,
        0.        , 0.05907414, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.03110216, 0.04647615, 0.        ,
        0.04566741],
       .
       .  # More similar vectors goes there
       .   
       [0.        , 0.01229661, 0.00056016, 0.        , 0.03534952,
        0.02654905, 0.03212402, 0.        , 0.        , 0.        ,
        0.        , 0.0913604 , 0.        , 0.        , 0.        ,
        0.        ]], dtype=float32)>

What am I missing? Is there any information on the Internet about training Mapping Network? Any math explanation? Got really confused :(

Innat
  • 16,113
  • 6
  • 53
  • 101
Fedoruka
  • 61
  • 5

2 Answers2

3

As I understand the mapping network is not trained separately. It it part of generator network and adjusts weights based on gradients just like other parts of the network.

In their stylegan generator code implementation it written the Generator is composed of two sub networks one mapping and another synthesis. In stylegan3 generator source it is much easier to see. The output of mapping is passed to synthesis network which generates image.

class Generator(torch.nn.Module):
    ...
    def forward(self, z, ...):
        ws = self.mapping(z, ...)
        img = self.synthesis(ws, ...)
        return img

The diagram below shows mapping network from stylegan 2019 paper. Section 2 describes about mapping network.

Generator Diagram with Mapping Layer

enter image description here

Mapping layer is represented with f in the paper that takes noise vector z initialized from normal distribution and maps to intermediate latent representation w. It is implemented with 8 layer MLP. Stylegan mapping network implementation has MLP layers set to 8.

In section 4 they mention,

a common goal is a latent space that consists of linear subspaces, each of which controls one factor of variation. However, the sampling probability of each combination of factors in Z needs to match the corresponding density in the training data.

A major benefit of our generator architecture is that the intermediate latent space W does not have to support sampling according to any fixed distribution.

So, z and w have same dimensions but w is more disentangled than z. Finding a w from intermediate latent space W for an image allows specific image editing.

From Encoder for Editing paper,

enter image description here

In stylegan2-ada paper with other changes they found mapping network depth of 2 better than 8. In stylegan3 mapping layer code implementation default number of layers in mapping is set to 2.

References

B200011011
  • 3,798
  • 22
  • 33
  • 1
    Thank you very much! Looks like I've made a mistake by reading only StyleGAN2Ada paper and skipping references :\ – Fedoruka Feb 08 '22 at 08:58
0

I'm going to try a visual explanation of the "disentanglement" concept in context of the mapping network in StyleGAN.

Setting

In the figure below, lets consider the task of generative modeling of human faces. Here, we have the prior latent space z and the learned posterior w. (the terms prior and posterior are not strictly accurate to use here) We also consider two "factors" relevant to human faces i.e. hair and eyes.

Explanation

In z we see that the subspaces of hair and eye factors are mixed, while in w they are "disentangled". Since there exists a non-linear mapping (i.e. the fully connected layers) between z and w, sampling a point in z gives us a point in w. The difference is that the point in w is now encoding for 1 specific factor compared to z that is (possibly) encoding for two or more factors.

This disentaglement gives us more smoother control as we traverse the latent space. Hence, the images produced by such a traversal have gradual variations that are more understandable to a human reader.

StyleGAN disentanglement

Update On second thought, the feature subspaces in w would be more like orthogonal lines (in contrast to the blobby spaces shown in my diagram). But an interesting aspect to think about is how the mapping network does this without having any explicit supervision for disentanglement. Surprisingly the regular GAN gradients are already able to create such a feature space.

prerakmody
  • 96
  • 6