We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
PyData Yerevan April 2022 Meetup | Data Standardization: The What, The Why and The How
Data standardization is crucial for e-commerce to deliver relevant search results and enhance user experience, and the PyData Yerevan team shares their approach and expert insights on how to achieve this goal.
- Data standardization is important in e-commerce to enable relevant search results and improve user experience.
- Currently, there is no standard way to normalize product names, leading to issues like irrelevant search results.
- A hierarchical approach to data standardization is proposed, focusing on product names, synonyms, and antonyms.
- Tokenization, named entity recognition, and natural language processing can be used to identify and standardize product names.
- Embeddings (e.g., word2vec) can be used to capture semantic relationships between words and improve matching.
- The PyData Yerevan team has developed a model that generates negative pairs by swapping tokens to create antonyms.
- Data preprocessing involves filtering out irrelevant data and separating synonyms and antonyms.
- A neural network architecture, such as siamese networks, can be used to compare similar products and identify matches.
- Training a model on a large dataset can be challenging, especially when dealing with ambiguous data and limited labeled data.
- Collecting labeled data through manual verification or using crowdsourcing platforms can improve the accuracy of the model.
- Standardizing data can also involve using ontologies and taxonomies to categorize and describe products.
- The speaker encourages data collectors to invest in proper data collection, labeling, and preprocessing to ensure high-quality datasets.
- Data standardization is important for building scalable and efficient AI models that can provide accurate search results.