Sander Land - Terrible tokenizer troubles in large language models | PyData Amsterdam 2024

Security Ai

Learn about critical tokenizer flaws in LLMs, their impact on model performance, and practical solutions for detecting and mitigating these issues in production systems.

Key takeaways

Tokenizers are a critical component in LLMs that convert text into numbers, but they can have serious flaws that affect model performance
Common tokenizer issues include:
- Inconsistent handling of spaces and capitalization
- Poor handling of non-English languages
- Creation of “glitch tokens” that can break model outputs
- Inefficient representation of common phrases
- Uneven token embedding sizes based on frequency
Models can be broken or manipulated through:
- Using rare/glitch tokens that have small embeddings
- Exploiting tokenizer inconsistencies
- Inserting invisible characters
- Using unusual Unicode sequences
Weight decay during training causes rare tokens to have smaller embeddings, making models less reliable when encountering these tokens
Recommendations for improving tokenizer robustness:
- Inspect training data carefully
- Implement guardrails for input validation
- Use automated detection for problematic tokens
- Consider frequency caps for training examples
- Verify model outputs when encountering unusual tokens
Current solutions are limited because:
- Tokenizers are trained separately from models
- Changes are difficult to implement after training
- Complete fixes would require model retraining
- Detection and mitigation are the best current approaches
The issues affect all major models including GPT-4, Llama, and Mistral, though in different ways based on their training data and tokenizer implementations

Sander Land - Terrible tokenizer troubles in large language models | PyData Amsterdam 2024

More talks