Ariel Gamiño - Leveraging LLM for text augmentation in Topic Modeling | PyData Global 2023

Learn how to leverage Large Language Models (LLM) for text augmentation in topic modeling, improving data quality and generating synthetic data for better insights and predictions.

Key takeaways
  • Data augmentation using LLM can be used to improve the quality of topic modeling by generating new data that is similar to existing data.
  • The LLM can be used to generate synthetic data that is more diverse and balanced than the original data.
  • Topic modeling can be used to identify patterns and topics in the text data, including clusters, keywords, and sentiment.
  • BERT and other transformer-based models can be used for topic modeling, and Yelp and Twitter data were used in the presentation.
  • The quality of the data and the randomness of the data can affect the results of the topic modeling.
  • The speaker used the web scraping library BeautifulSoup to collect data from the web.
  • The data was then tokenized and processed using the NLTK library.
  • The LLM was used to generate synthetic data, and the sentiment analysis was used to analyze the generated data.
  • The coherence score and diversity score were used to evaluate the quality of the generated data.
  • The speaker used the OpenFoodFacts dataset to train the LLM and generate synthetic data.
  • The data was then used to train a topic model using the word topic algorithm.
  • The topic model was evaluated using metrics such as perplexity and coherence score.
  • The speaker also used the LLM to generate synthetic data for product reviews and descriptive text.
  • The generated data was used to train a classifier to predict the sentiment of the reviews.
  • The speaker used the LLM to generate synthetic data for product names, descriptions, and categories.
  • The generated data was used to train a topic model using the word topic algorithm.
  • The topic model was evaluated using metrics such as perplexity and coherence score.
  • The speaker also used the LLM to generate synthetic data for product characteristics, such as ingredients, but the results were not presented in the talk.
  • The speaker emphasized the importance of quality control and filtering in the data generation process to ensure the generated data is useful and accurate.
  • The speaker also emphasized the importance of evaluating the quality of the generated data using metrics such as coherence score and diversity score.