Benoit Hamelin - Vector space embeddings and data maps for cyber defense | SciPy 2024

Learn how vector space embeddings and data maps can revolutionize cyber defense through unsupervised analysis of system telemetry, enabling threat detection and anomaly analysis.

Key takeaways
  • Vectorizers library provides theoretically well-founded TF-IDF approach for turning data into vectors, particularly useful for cyber defense telemetry analysis

  • Command line telemetry provides valuable insights into system behavior, with ~30,000 command lines expressed over ~20,000 dimensional vocabulary space

  • Wasserstein vectorization improves upon naive bag-of-words approaches by preserving local similarity structure and handling token co-occurrence relationships

  • UMAP visualization helps compress high-dimensional vectors into 2D representations while maintaining important relationship structures

  • Multiple “lenses” (perspectives) are needed to properly analyze IT infrastructure, including:

    • Command line analysis
    • Shared code library analysis
    • Process relationships
    • Data access patterns
  • Interactive data maps enable exploration and labeling of clusters for anomaly detection and threat analysis

  • The approach uses scikit-learn style APIs and idioms for familiarity and ease of implementation

  • While not replacing existing NLP models, this vectorization approach is competitive with BERT for certain applications while being more computationally efficient

  • Key focus is on unsupervised analysis since labeled data for cyber security is limited

  • The technique helps identify baseline behaviors and potential anomalies in system activity through clustering and similarity analysis