Sean P. Rogers - Introduction to Machine Learning for Text Analysis and Classification with Python

Python

Learn how to build text classification models in Python using machine learning. Covers preprocessing, feature engineering, model training & evaluation with NLTK and scikit-learn.

Key takeaways

Machine learning pipeline focuses on text preprocessing, feature engineering, and model training/evaluation using Python libraries like NLTK, scikit-learn, and pandas
Dataset consisted of ~1000 labeled tweets about wildlife selfies, categorized into classes like abusive, benign, and educational interactions
Key preprocessing steps include:
- Removing stop words, punctuation, usernames
- Lemmatization for word normalization
- Emoji handling
- Text vectorization using TF-IDF
Random Forest classifier performed well for this use case with ~90% F1 score average, preferred over SVM due to better explainability
Cross-validation and confusion matrices used to evaluate model performance and reduce overfitting
Feature engineering through one-hot encoding of key terms/signals helped distinguish between classes
Temporal analysis revealed spikes in wildlife selfie activity during vacation periods (June/July, March break)
Focus on making models explainable and accessible to non-technical stakeholders rather than pursuing maximum accuracy
Important to explore data through visualization and manual review before building models
Classical ML approaches can be preferable to deep learning/LLMs when explainability and reproducibility are priorities

Sean P. Rogers - Introduction to Machine Learning for Text Analysis and Classification with Python

More talks