Vision Transformers
← Back to Computer Vision
Applying the Transformer architecture to image patches rather than text tokens. ViT (Vision Transformer) splits images into patches, embeds them, and processes with standard Transformer blocks. Competitive with CNNs at scale, especially with large datasets.
Related
- Transformers (underlying architecture)
- CNN Architectures (what ViT competes with)