Multi-Head Attention
← Back to Transformers
Run multiple self-attention operations in parallel with different learned projections. Each head can attend to different aspects of the input (syntax, semantics, position). Outputs are concatenated and linearly projected.
Related
- Self-Attention (single attention head)
- Layer Normalization (applied around attention)