Understanding Vision Transformer Implementation in Keras: Issues with Patch Shape and Embedding Layer

Question

I'm trying to understand this implementation of vision transformers in keras.

Here is the full code.

I can't understand why patches = tf.reshape(patches, [batch_size, -1, patch_dims]) is returning a tensor (batch_size,num_patches,patch_dim) with shape of (none,none,108) instead of a tensor of shape (none,144,108), in this case is returned only one patch and I can

The dimension of patches before being reshaped is (none,12,12,108) in which 12 and 12 are the height and width of all the patches in the image

class Patches(layers.Layer):
    def __init__(self, patch_size):
        super().__init__()
        self.patch_size = patch_size

    def call(self, images):
        batch_size = tf.shape(images)[0]
        patches = tf.image.extract_patches(
            images=images,
            sizes=[1, self.patch_size, self.patch_size, 1],
            strides=[1, self.patch_size, self.patch_size, 1],
            rates=[1, 1, 1, 1],
            padding="VALID",
        )
        patch_dims = patches.shape[-1]
        patches = tf.reshape(patches, [batch_size, -1, patch_dims])
        return patches

Later this tensor is then passed to the PatchEncoder() that passes this 108 elements patch in a 64 dimension dense layer but this should not be done for each of the 144 patches instead of just one(the returned patch of Patches())?

So that I can have an embedding layer for each of the 144 patches I have 64 dimension vector elements all different from each other based on the corresponding patch?

class PatchEncoder(layers.Layer):
    def __init__(self, num_patches, projection_dim):
        super().__init__()
        self.num_patches = num_patches
        self.projection = layers.Dense(units=projection_dim)
        self.position_embedding = layers.Embedding(
            input_dim=num_patches, output_dim=projection_dim
        )

    def call(self, patch):
        positions = tf.range(start=0, limit=self.num_patches, delta=1)
        encoded = self.projection(patch) + self.position_embedding(positions)
        return encoded

So I thought that the embedding layer should be something like this in which for each patch I have different values based on the values in the actual patch

**Embedding layer that I think should be returned**
    0.[0 0 0 ... 0]
    1.[1 1 1 ... 1]
    .
    .
    .
    143.[143 143 143 ... 143]

Instead of this in which all the values in the initial patches are the same because of the shape return in tf.reshape()

**Embedding layer that I think is returned but I don't understand if it makes sense**
    0.[0 0 0 ... 0]
    1.[0 0 0 ... 0]
    .
    .
    .
    143.[0 0 0 ... 0]

My question is how passing a tensor of shape (none,none,108) make sense with this ViT implementation?

Here is also the summary of the model:

 input_3 (InputLayer)           [(None, 224, 224, 3  0           []                               
                                )]                                                                
                                                                                                  
 data_augmentation (Sequential)  (None, 72, 72, 3)   7           ['input_3[0][0]']                
                                                                                                  
 patches_2 (Patches)            (None, None, 108)    0           ['data_augmentation[1][0]']      
                                                                                                  
 patch_encoder_2 (PatchEncoder)  (None, 144, 64)     16192       ['patches_2[0][0]']

score 0 · Answer 1 · answered Mar 13 '23 at 13:04

In the implementation of the Vision Transformer model, each patch is first passed through a PatchEncoder layer, which consists of a projection layer and an embedding layer. The projection layer maps the 108-dimensional patch representation to a 64-dimensional vector, while the embedding layer adds a positional encoding to each patch. The positional encoding is a vector that is added to the patch representation to encode its position in the image.

However, it is important to note that the same 64-dimensional vector is used for each patch, while the positional encoding is different for each patch. This is because the projection layer is shared across all the patches, so it produces the same 64-dimensional vector for each patch. The positional encoding, on the other hand, is unique to each patch, as it depends on the position of the patch in the image.

Understanding Vision Transformer Implementation in Keras: Issues with Patch Shape and Embedding Layer

1 Answers1