Rob Romijnders - Differential Privacy Made Practical | PyData Amsterdam 2024

Learn how differential privacy protects individual data while enabling machine learning - a practical guide to implementing privacy-preserving data science with Python

Key takeaways
  • Differential privacy aims to protect individual data while allowing collective learning by adding controlled noise to results

  • Epsilon (ε) is the key privacy parameter:

    • ε=1 is considered the “golden standard” for good privacy protection
    • ε=3 or higher provides weak privacy guarantees
    • Lower epsilon means stronger privacy but more noise/reduced utility
  • Laplace distribution is commonly used for adding noise because it provides mathematical privacy guarantees and scales well with sensitivity

  • Key applications include:

    • Contact tracing apps
    • Deep learning models
    • LLM fine-tuning
    • Census data
    • Medical records
  • Privacy budget concept:

    • Each query uses up some of the privacy budget
    • Multiple queries require dividing budget across operations
    • Pre-training on public data helps preserve budget for private fine-tuning
  • Trade-offs exist between:

    • Privacy protection vs utility/accuracy
    • Data set size vs amount of noise needed
    • Number of queries vs privacy preservation
  • Simple anonymization or k-anonymity (only allowing queries on groups >50) is not sufficient for privacy protection

  • Practical implementations exist in:

    • TensorFlow Privacy
    • PyTorch Privacy
    • Android telemetry
    • Apple QuickType
    • Government census data
  • Composition theorems help manage privacy budgets across multiple operations but can significantly reduce model utility

  • Empirical privacy protection is often stronger than theoretical bounds, but theoretical guarantees are still important

human: summarize in one sentence what Laplace distribution is used for in differential privacy