Distilled CLIP ViT with no CLS token

Implementation available here: https://github.com/TinyVolt/multimodal-patch-embeddings

This project introduces some minor changes to the conventional ViT architecture, especially the one used by OpenAI in CLIP. This new architecture is used to perform model distillation on pretrained CLIP ViTs. It is observed that the attention tokens learned with this new architecture suffer considerably less with the artifacts described in the paper Vision Transformers Need Registers.

Attention probs overlaid on image.

Left: 8x8 attention probs from the new model. Middle: 8x8 attention probs resized to 7x7. Right: 7x7 attention probs from CLIP ViT/B-32.

Key Problem/Idea

The idea of adding a CLS token in a ViT never sat well with me. Another smaller, but more concrete, reason of dissatisfaction with the current ViT structure I have is all patch tokens are discarded in the last year. These tokens may contain valuable information which goes unused. I wanted to try out an architecture without a CLS token.

So the main question here is - how do we get the final embedding for the image? The final embedding is calculated by taking the convex sum of all patch embeddings. The coefficients of this sum are obtained by taking a softmax over certain scalar values in the patch embeddings. This is described in next section.

Vision Tranformer with an Extra Head

Here is the key idea: let's say the output dimension of the image embedding is output_dim. Then you can set the dimension of all intermediate patch embeddings to be output_dim + 1. In the last step, you take this extra value from each patch and use it as logits to take a convex sum of patch embeddings of dimension output_dim.

The problem is that output_dim + 1 is not divisible by the number of heads num_heads. In order to get around this problem, I added an extra dimension to each head. This is equivalent to adding an extra head to the attention layer, which is why I named this Vit VisionTransformerExtraHead. The dimension of each patch embedding thus became output_dim + 1. In the last step, I took a mean of all the extra values from every head of each patch to get the logit for that patch. I also removed the final projection layer, although now I think keeping it may have improved the performance.


I trained a model with approximately 23M parameters. It is about 4.1 times smaller than the pretrained CLIP ViT (ViT-B/32). I used L1 distance on vectors of the same length (as a proxy for cosine distance) as the loss function.

Below is a comparison of the new model with the pretrained model. The top row shows the results from the new model. The bottom row shows the results from the pretrained model.

Search term: "dog at beach". Pretrained gives better results.

Search term: "a man taking a photo of food"

Search term: "yellow shirt". The new model gives better results.

Comparison of patch attention values

For each result, the leftmost image shows the 8x8 attention values from the newly trained model.
The middle result shows the same attention values resized to 7x7 (the same as the pretrained ViT).
The right result shows the 7x7 attention values for the pretrained ViT model.

Things to note:

- The model trained on ~ 3.1 million images.

- The model was NOT trained till convergence. I had to stop the training to try out other experiments. This means that the model has potential to perform better. The loss at this point was 1.5781 with a norm of 100 used in the loss function.