Vision Transformers

Back to Computer Vision

Applying the Transformer architecture to image patches rather than text tokens. ViT (Vision Transformer) splits images into patches, embeds them, and processes with standard Transformer blocks. Competitive with CNNs at scale, especially with large datasets.


computer-vision vision-transformers vit