Distillation of CLIP ViT

A modified version of ViT trained on 3 million images. This architecture has no CLS embedding. The image embedding is a convex combination of the patch embeddings of the same norm. The logits for the convex sum are also learned during training.

This leads to a more compact model where the attention to patches is (semantically) more meaningful when compared with pretrained CLIP ViTs.

Multimodal patch embeddings

One small modification in the distillation of CLIP ViT leads to patch embeddings which themselves are multimodal. This allows one to compare a text vector not just with the image vector but also with the patch vectors.

It opens up possibilities like achieving open set detection without the need for cross attention between the modalities, interpretability of the model, and more.