Transformers
Back: Deep Learning
The dominant architecture in modern deep learning, based on self-attention mechanisms. Transformers process all positions in parallel and capture long-range dependencies, powering LLMs, vision models, and multi-modal systems.
Concepts
- Self-Attention
- Multi-Head Attention
- Positional Encoding
- Encoder-Decoder Architecture
- Encoder-Only Models
- Decoder-Only Models
- Scaling Laws
- Flash Attention
- Mixture of Experts