Vocabulary

Back to Text Processing Pipeline

The mapping from tokens to integer IDs that the model uses. Includes special tokens for specific model needs.

Key Properties

  • Token-to-ID mapping (lookup table)
  • Special tokens: [CLS] (classification), [SEP] (separator), [PAD] (padding), [MASK] (masking), [BOS]/[EOS] (begin/end of sequence)
  • Vocabulary size tradeoff: larger = better coverage but more parameters

nlp vocabulary