Understanding Probabilistic Data Structures with 112,092 UFO Sightings - Guy Royse - NDC London 2023

Explore probabilistic data structures for accuracy and performance trade-offs in real-time problem-solving with UFO sighting examples.

Key takeaways
  • Probabilistic data structures are not accurate, but can be fast and space-efficient.
  • Bloom filters are cool because they can give you a fast “probable” answer.
  • You can add items to Bloom filters, and then ask if an item is in the filter without checking all the items.
  • If an item is not in the filter, it’s definitely not in the filter.
  • If an item is in the filter, it’s probably in the filter.
  • Bloom filters are O(1) for reads and O(n) for updates.
  • Redis has Bloom filters, as well as other probabilistic data structures like CountMinSketch and HyperLogLog.
  • CountMinSketch is good for counting frequent items.
  • HyperLogLog is good for counting unique items.
  • MinHash is a neat tool for finding similar sets of items.
  • Similarity can be calculated using Jaccard similarity and union and intersection operations.
  • You can use probabilistic data structures to track trending topics, count unique users, and more.
  • Deterministic data structures are accurate but slow.
  • Probabilistic data structures are fast, but may not always be accurate.
  • The author’s favorite photo in the talk is a photo of a UFO sighting.