Inside GPT – Large Language Models Demystified - Alan Smith - NDC Oslo 2024

Explore how large language models like GPT actually work, from tokens and embeddings to attention mechanisms and prompt engineering, with practical examples and insights.

Key takeaways
  • Language models work by predicting the statistical probability distribution of the next token in a sequence, they don’t truly “understand” text

  • Tokens are not words - they are common sequences of characters found mostly in English text, with about 50,000 tokens in GPT’s vocabulary

  • Models use embeddings (768-dimensional vectors in GPT-2) to represent tokens mathematically, allowing the model to perform calculations and find relationships between tokens

  • Temperature and top-P (nucleus sampling) are key parameters for controlling randomness in token selection - lower temperature makes outputs more deterministic, higher allows more creativity

  • English is generally most efficient for prompts due to tokenization being optimized for English, with other languages requiring more tokens for the same content

  • Position and sequence order of tokens is critical - models use positional embeddings to understand token location in sequences

  • Models have a fixed context window and training data cutoff - they can’t remember beyond their context length or know about events after their training cutoff

  • Attention mechanisms allow models to identify relationships between tokens in the input sequence, replacing older RNN/LSTM approaches

  • Prompt engineering in English tends to work best, even when generating outputs in other languages

  • Retrieval augmented generation (RAG) can enhance models by providing them access to external knowledge sources beyond their training data