Entity Resolution at Scale • Huon Wilson • YOW! 2019

Learn how P-SIG, a probabilistic signature-based entity resolution algorithm, matches records at scale with Apache Spark, and discover how to optimize it for precision or recall in this YOW! 2019 conference talk.

Key takeaways
  • Entity resolution is a technique used to match records that refer to the same real-world entity, such as people or businesses.
  • Naive approaches to entity resolution can be slow and inefficient, especially when dealing with large datasets.
  • P-SIG is a probabilistic signature-based entity resolution algorithm that uses micro-blocking and candidate signatures to improve performance.
  • Micro-blocking allows for the creation of smaller sets of records that can be processed independently, reducing the number of comparisons needed.
  • Candidate signatures are generated from the data and used to narrow down the search space for potential matches.
  • P-SIG can be optimized for precision or recall, depending on the specific use case.
  • The algorithm can be implemented using Apache Spark, which provides a scalable and efficient way to process large datasets.
  • Data frames in Spark can be used to represent data in a partitioned and parallelizable way, which can improve performance.
  • Profiling and optimization techniques can be used to improve the performance of the algorithm.
  • The algorithm can be applied to a variety of data sets, including those with missing or corrupt data.
  • It is important to choose the right threshold for the algorithm, as this can affect its performance.
  • The algorithm can be used to resolve duplicates at scale, and can be applied to data sets of various sizes and complexity.