PyData Yerevan April 2022 Meetup | Data Standardization: The What, The Why and The How

Data standardization is crucial for e-commerce to deliver relevant search results and enhance user experience, and the PyData Yerevan team shares their approach and expert insights on how to achieve this goal.

Key takeaways

Data standardization is important in e-commerce to enable relevant search results and improve user experience.
Currently, there is no standard way to normalize product names, leading to issues like irrelevant search results.
A hierarchical approach to data standardization is proposed, focusing on product names, synonyms, and antonyms.
Tokenization, named entity recognition, and natural language processing can be used to identify and standardize product names.
Embeddings (e.g., word2vec) can be used to capture semantic relationships between words and improve matching.
The PyData Yerevan team has developed a model that generates negative pairs by swapping tokens to create antonyms.
Data preprocessing involves filtering out irrelevant data and separating synonyms and antonyms.
A neural network architecture, such as siamese networks, can be used to compare similar products and identify matches.
Training a model on a large dataset can be challenging, especially when dealing with ambiguous data and limited labeled data.
Collecting labeled data through manual verification or using crowdsourcing platforms can improve the accuracy of the model.
Standardizing data can also involve using ontologies and taxonomies to categorize and describe products.
The speaker encourages data collectors to invest in proper data collection, labeling, and preprocessing to ensure high-quality datasets.
Data standardization is important for building scalable and efficient AI models that can provide accurate search results.

PyData Yerevan April 2022 Meetup | Data Standardization: The What, The Why and The How

More talks