Sander Land - Terrible tokenizer troubles in large language models | PyData Amsterdam 2024

Learn about critical tokenizer flaws in LLMs, their impact on model performance, and practical solutions for detecting and mitigating these issues in production systems.

Key takeaways
  • Tokenizers are a critical component in LLMs that convert text into numbers, but they can have serious flaws that affect model performance

  • Common tokenizer issues include:

    • Inconsistent handling of spaces and capitalization
    • Poor handling of non-English languages
    • Creation of “glitch tokens” that can break model outputs
    • Inefficient representation of common phrases
    • Uneven token embedding sizes based on frequency
  • Models can be broken or manipulated through:

    • Using rare/glitch tokens that have small embeddings
    • Exploiting tokenizer inconsistencies
    • Inserting invisible characters
    • Using unusual Unicode sequences
  • Weight decay during training causes rare tokens to have smaller embeddings, making models less reliable when encountering these tokens

  • Recommendations for improving tokenizer robustness:

    • Inspect training data carefully
    • Implement guardrails for input validation
    • Use automated detection for problematic tokens
    • Consider frequency caps for training examples
    • Verify model outputs when encountering unusual tokens
  • Current solutions are limited because:

    • Tokenizers are trained separately from models
    • Changes are difficult to implement after training
    • Complete fixes would require model retraining
    • Detection and mitigation are the best current approaches
  • The issues affect all major models including GPT-4, Llama, and Mistral, though in different ways based on their training data and tokenizer implementations