Tokenization

Back to Text Processing Pipeline

Breaking text into tokens (the atomic units the model processes). The tokenization strategy directly affects vocabulary size, handling of rare words, and model performance.

Key Methods

  • Word-level — split on whitespace/punctuation, large vocabulary, OOV problem
  • Subword (BPE) — Byte-Pair Encoding, merge frequent character pairs, used in GPT
  • WordPiece — similar to BPE, used in BERT
  • SentencePiece — language-agnostic, treats text as raw bytes, used in LLaMA
  • Unigram — probabilistic, selects most likely tokenization

nlp tokenization