Entity Resolution at Scale • Huon Wilson • YOW! 2019

Learn how P-SIG, a probabilistic signature-based entity resolution algorithm, matches records at scale with Apache Spark, and discover how to optimize it for precision or recall in this YOW! 2019 conference talk.

Key takeaways

Entity resolution is a technique used to match records that refer to the same real-world entity, such as people or businesses.
Naive approaches to entity resolution can be slow and inefficient, especially when dealing with large datasets.
P-SIG is a probabilistic signature-based entity resolution algorithm that uses micro-blocking and candidate signatures to improve performance.
Micro-blocking allows for the creation of smaller sets of records that can be processed independently, reducing the number of comparisons needed.
Candidate signatures are generated from the data and used to narrow down the search space for potential matches.
P-SIG can be optimized for precision or recall, depending on the specific use case.
The algorithm can be implemented using Apache Spark, which provides a scalable and efficient way to process large datasets.
Data frames in Spark can be used to represent data in a partitioned and parallelizable way, which can improve performance.
Profiling and optimization techniques can be used to improve the performance of the algorithm.
The algorithm can be applied to a variety of data sets, including those with missing or corrupt data.
It is important to choose the right threshold for the algorithm, as this can affect its performance.
The algorithm can be used to resolve duplicates at scale, and can be applied to data sets of various sizes and complexity.

Entity Resolution at Scale • Huon Wilson • YOW! 2019

More talks