Bernice Waweru - Tricking Neural Networks : Explore Adversarial Attacks | PyData Global 2023

Learn how adversarial attacks trick neural networks, explore defense mechanisms, and understand security implications for machine learning models in this PyData Global talk.

Key takeaways
  • Adversarial attacks are carefully designed perturbations added to inputs that trick neural networks into producing incorrect outputs while remaining imperceptible to humans

  • Neural networks are more vulnerable to adversarial attacks compared to traditional ML models like logistic regression, primarily due to their reliance on gradient descent during training

  • Two main types of attacks:

    • White box attacks: Attacker knows model architecture, parameters and training data
    • Black box attacks: Attacker only has access to model outputs and must design attacks based on responses
  • Adversarial attacks are transferable - attacks designed for one model can often successfully fool other models, even those with different architectures

  • Key defense mechanisms:

    • Input sanitization: Validate and clean user inputs before processing
    • Adversarial training: Include adversarial examples in training data to build robustness
    • Implement multiple defense methods as no single approach is completely effective
  • Generating adversarial attacks often involves:

    • Leveraging gradient descent
    • Finding minimal input changes that maximize loss function
    • Creating imperceptible perturbations that cause misclassification
  • Adversarial attacks are particularly concerning for LLMs and production systems in critical domains like finance, where incorrect predictions could have significant consequences

  • While computationally expensive to generate, adversarial examples pose a serious security risk as motivated attackers can exploit these vulnerabilities

  • Open source models are especially vulnerable since their architectures and parameters are publicly available

  • Successful attacks can occur through subtle changes like adding imperceptible characters or changing single words while maintaining semantic meaning