Multi-Head Attention

← Back to Transformers

Run multiple self-attention operations in parallel with different learned projections. Each head can attend to different aspects of the input (syntax, semantics, position). Outputs are concatenated and linearly projected.

Self-Attention (single attention head)
Layer Normalization (applied around attention)

deep-learning transformers multi-head-attention

Software Engineering KB

Explorer

Multi-Head Attention

Multi-Head Attention

Graph View

Table of Contents

Backlinks

Software Engineering KB

Explorer

Multi-Head Attention

Multi-Head Attention

Related

Graph View

Table of Contents

Backlinks