PyData Yerevan April 2022 Meetup | Data Standardization: The What, The Why and The How

Data standardization is crucial for e-commerce to deliver relevant search results and enhance user experience, and the PyData Yerevan team shares their approach and expert insights on how to achieve this goal.

Key takeaways
  • Data standardization is important in e-commerce to enable relevant search results and improve user experience.
  • Currently, there is no standard way to normalize product names, leading to issues like irrelevant search results.
  • A hierarchical approach to data standardization is proposed, focusing on product names, synonyms, and antonyms.
  • Tokenization, named entity recognition, and natural language processing can be used to identify and standardize product names.
  • Embeddings (e.g., word2vec) can be used to capture semantic relationships between words and improve matching.
  • The PyData Yerevan team has developed a model that generates negative pairs by swapping tokens to create antonyms.
  • Data preprocessing involves filtering out irrelevant data and separating synonyms and antonyms.
  • A neural network architecture, such as siamese networks, can be used to compare similar products and identify matches.
  • Training a model on a large dataset can be challenging, especially when dealing with ambiguous data and limited labeled data.
  • Collecting labeled data through manual verification or using crowdsourcing platforms can improve the accuracy of the model.
  • Standardizing data can also involve using ontologies and taxonomies to categorize and describe products.
  • The speaker encourages data collectors to invest in proper data collection, labeling, and preprocessing to ensure high-quality datasets.
  • Data standardization is important for building scalable and efficient AI models that can provide accurate search results.