Self-Attention

Back to Transformers

The core mechanism of Transformers. Each position attends to all other positions, computing attention scores from Query, Key, and Value matrices. Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) * V. Enables capturing long-range dependencies in O(1) layers.


deep-learning transformers attention