We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Entity Resolution at Scale • Huon Wilson • YOW! 2019
Learn how P-SIG, a probabilistic signature-based entity resolution algorithm, matches records at scale with Apache Spark, and discover how to optimize it for precision or recall in this YOW! 2019 conference talk.
- Entity resolution is a technique used to match records that refer to the same real-world entity, such as people or businesses.
- Naive approaches to entity resolution can be slow and inefficient, especially when dealing with large datasets.
- P-SIG is a probabilistic signature-based entity resolution algorithm that uses micro-blocking and candidate signatures to improve performance.
- Micro-blocking allows for the creation of smaller sets of records that can be processed independently, reducing the number of comparisons needed.
- Candidate signatures are generated from the data and used to narrow down the search space for potential matches.
- P-SIG can be optimized for precision or recall, depending on the specific use case.
- The algorithm can be implemented using Apache Spark, which provides a scalable and efficient way to process large datasets.
- Data frames in Spark can be used to represent data in a partitioned and parallelizable way, which can improve performance.
- Profiling and optimization techniques can be used to improve the performance of the algorithm.
- The algorithm can be applied to a variety of data sets, including those with missing or corrupt data.
- It is important to choose the right threshold for the algorithm, as this can affect its performance.
- The algorithm can be used to resolve duplicates at scale, and can be applied to data sets of various sizes and complexity.