We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Sander Land - Terrible tokenizer troubles in large language models | PyData Amsterdam 2024
Learn about critical tokenizer flaws in LLMs, their impact on model performance, and practical solutions for detecting and mitigating these issues in production systems.
-
Tokenizers are a critical component in LLMs that convert text into numbers, but they can have serious flaws that affect model performance
-
Common tokenizer issues include:
- Inconsistent handling of spaces and capitalization
- Poor handling of non-English languages
- Creation of “glitch tokens” that can break model outputs
- Inefficient representation of common phrases
- Uneven token embedding sizes based on frequency
-
Models can be broken or manipulated through:
- Using rare/glitch tokens that have small embeddings
- Exploiting tokenizer inconsistencies
- Inserting invisible characters
- Using unusual Unicode sequences
-
Weight decay during training causes rare tokens to have smaller embeddings, making models less reliable when encountering these tokens
-
Recommendations for improving tokenizer robustness:
- Inspect training data carefully
- Implement guardrails for input validation
- Use automated detection for problematic tokens
- Consider frequency caps for training examples
- Verify model outputs when encountering unusual tokens
-
Current solutions are limited because:
- Tokenizers are trained separately from models
- Changes are difficult to implement after training
- Complete fixes would require model retraining
- Detection and mitigation are the best current approaches
-
The issues affect all major models including GPT-4, Llama, and Mistral, though in different ways based on their training data and tokenizer implementations