We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Benoit Hamelin - Vector space embeddings and data maps for cyber defense | SciPy 2024
Learn how vector space embeddings and data maps can revolutionize cyber defense through unsupervised analysis of system telemetry, enabling threat detection and anomaly analysis.
-
Vectorizers library provides theoretically well-founded TF-IDF approach for turning data into vectors, particularly useful for cyber defense telemetry analysis
-
Command line telemetry provides valuable insights into system behavior, with ~30,000 command lines expressed over ~20,000 dimensional vocabulary space
-
Wasserstein vectorization improves upon naive bag-of-words approaches by preserving local similarity structure and handling token co-occurrence relationships
-
UMAP visualization helps compress high-dimensional vectors into 2D representations while maintaining important relationship structures
-
Multiple “lenses” (perspectives) are needed to properly analyze IT infrastructure, including:
- Command line analysis
- Shared code library analysis
- Process relationships
- Data access patterns
-
Interactive data maps enable exploration and labeling of clusters for anomaly detection and threat analysis
-
The approach uses scikit-learn style APIs and idioms for familiarity and ease of implementation
-
While not replacing existing NLP models, this vectorization approach is competitive with BERT for certain applications while being more computationally efficient
-
Key focus is on unsupervised analysis since labeled data for cyber security is limited
-
The technique helps identify baseline behaviors and potential anomalies in system activity through clustering and similarity analysis