Jovan Stojanovic Machine learning with dirty tables: encoding, joining and deduplicating

Learn how to harness the power of machine learning with dirty tables using encoding, joining, and deduplicating techniques. Discover the advantages of DirtyCat, a Python package designed to simplify the process.

Key takeaways

Dirty data is a common problem in machine learning, but it’s often misunderstood.
There are many ways to encode dirty variables, and different methods may be better suited to different problems.
One-hot encoding is a common method, but it can be slow and may not work well for high-cardinality columns.
The similarity encoder uses n-gram similarity to group similar categories together, and can be useful for joining tables with similar data.
The min hash encoder uses a hash function to group similar categories together, and can be useful for joining tables with similar data.
The gap encoder uses a gap penalty to group similar categories together, and can be useful for joining tables with similar data.
Dirty data can be anything that is not well-represented by your machine learning model, including typos, missing values, and categorical data.
Fuzzy joins can be used to join tables with imprecise correspondences, and can be useful for joining tables with missing values.
The match score parameter can be used to work out which joins are most similar, and can be useful for identifying the best joins to make.
The duplicate function can be used to group similar categories together, and can be useful for joining tables with similar data.
The table_vectorizer class can be used to automatically encode tables, and can be useful for joining tables with similar data.
DirtyCat is an open-source package that can be used to encode and join tables with similar data.
The package includes a variety of encoding methods, including one-hot encoding, similarity encoding, and min hash encoding.
The package also includes a fuzzy join method, which can be used to join tables with imprecise correspondences.
DirtyCat is designed to be easy to use, and can be integrated into your machine learning pipeline with just a few lines of code.
The package is actively maintained, and new features and methods are being added regularly.

Jovan Stojanovic Machine learning with dirty tables: encoding, joining and deduplicating

More talks