We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Benoit Hamelin - Vector space embeddings and data maps for cyber defense | SciPy 2024
Learn how vector space embeddings and data maps can revolutionize cyber defense through unsupervised analysis of system telemetry, enabling threat detection and anomaly analysis.
- 
    Vectorizers library provides theoretically well-founded TF-IDF approach for turning data into vectors, particularly useful for cyber defense telemetry analysis 
- 
    Command line telemetry provides valuable insights into system behavior, with ~30,000 command lines expressed over ~20,000 dimensional vocabulary space 
- 
    Wasserstein vectorization improves upon naive bag-of-words approaches by preserving local similarity structure and handling token co-occurrence relationships 
- 
    UMAP visualization helps compress high-dimensional vectors into 2D representations while maintaining important relationship structures 
- 
    Multiple “lenses” (perspectives) are needed to properly analyze IT infrastructure, including: - Command line analysis
- Shared code library analysis
- Process relationships
- Data access patterns
 
- 
    Interactive data maps enable exploration and labeling of clusters for anomaly detection and threat analysis 
- 
    The approach uses scikit-learn style APIs and idioms for familiarity and ease of implementation 
- 
    While not replacing existing NLP models, this vectorization approach is competitive with BERT for certain applications while being more computationally efficient 
- 
    Key focus is on unsupervised analysis since labeled data for cyber security is limited 
- 
    The technique helps identify baseline behaviors and potential anomalies in system activity through clustering and similarity analysis