Unconference - Stream 2 - 13:15-14:15 - PGCon 2022

"Discover the world of data sketches and learn how to estimate selectivity, count distinct values, and balance accuracy and space efficiency for your datasets, exploring options like HyperLogLog, Bitmaps, and bloom filters."

Key takeaways
  • As the sampling size increases, the estimates become more accurate, but may not be representatively balanced for distributed tables.
  • Data sketches can be used to store combined statistics for different columns, allowing for more accurate estimation of selectivity.
  • HyperLogLog (HLL) can be used to estimate the number of distinct values in a set, but has limitations for complex operations like intersections and unions.
  • Bitmaps can be used to combine sketches and store the results, with a space complexity that grows linearly with the number of samples.
  • Thomas proposed counting bloom filters, which can be used to solve specific problems, such as set intersection and set subtraction.
  • The speaker suggests using bloom filters for simpler use cases, as they are easier to understand and implement.
  • The aim is to store a single sketch for a table, which can be used to estimate selectivity for different columns.
  • Other sketches, such as HyperLogLog and data sketches, can provide more accurate estimates at the cost of increased space complexity.
  • The speaker believes that using simpler sketches, like bitmaps, could be a good starting point.
  • The goal is to find a sketch that provides a good balance between accuracy and space efficiency.
  • Deterministic sampling may not be necessary if the sampling rate is low, and the results can still be accurate.
  • The speaker suggests exploring other sketches, such as counting bloom filters, to find the best solution for the problem.
  • The space requirements for storing sketches are important, especially for large datasets.
  • The speaker concludes that there are many different sketches to choose from, and the best solution will depend on the specific use case.
  • Other important factors to consider include the structure of the data, the size of the dataset, and the trade-off between accuracy and space efficiency.