We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ariel Gamiño - Leveraging LLM for text augmentation in Topic Modeling | PyData Global 2023
Learn how to leverage Large Language Models (LLM) for text augmentation in topic modeling, improving data quality and generating synthetic data for better insights and predictions.
- Data augmentation using LLM can be used to improve the quality of topic modeling by generating new data that is similar to existing data.
- The LLM can be used to generate synthetic data that is more diverse and balanced than the original data.
- Topic modeling can be used to identify patterns and topics in the text data, including clusters, keywords, and sentiment.
- BERT and other transformer-based models can be used for topic modeling, and Yelp and Twitter data were used in the presentation.
- The quality of the data and the randomness of the data can affect the results of the topic modeling.
- The speaker used the web scraping library BeautifulSoup to collect data from the web.
- The data was then tokenized and processed using the NLTK library.
- The LLM was used to generate synthetic data, and the sentiment analysis was used to analyze the generated data.
- The coherence score and diversity score were used to evaluate the quality of the generated data.
- The speaker used the OpenFoodFacts dataset to train the LLM and generate synthetic data.
- The data was then used to train a topic model using the word topic algorithm.
- The topic model was evaluated using metrics such as perplexity and coherence score.
- The speaker also used the LLM to generate synthetic data for product reviews and descriptive text.
- The generated data was used to train a classifier to predict the sentiment of the reviews.
- The speaker used the LLM to generate synthetic data for product names, descriptions, and categories.
- The generated data was used to train a topic model using the word topic algorithm.
- The topic model was evaluated using metrics such as perplexity and coherence score.
- The speaker also used the LLM to generate synthetic data for product characteristics, such as ingredients, but the results were not presented in the talk.
- The speaker emphasized the importance of quality control and filtering in the data generation process to ensure the generated data is useful and accurate.
- The speaker also emphasized the importance of evaluating the quality of the generated data using metrics such as coherence score and diversity score.