About This Project
An end-to-end machine learning pipeline for Twitter sentiment analysis. The project began with exploratory data analysis on the Sentiment140 dataset (1.6 million tweets) in a Jupyter notebook, where data patterns, text characteristics, and preprocessing techniques were analyzed. Following this analysis, a production-ready sentiment analyzer was built that classifies text into positive or negative sentiments using a Logistic Regression model with TF-IDF vectorization.
Key Features
- End-to-end ML pipeline: From raw tweet data exploration to trained model deployment
- Real-time sentiment prediction: Interactive web interface for instant text classification
- Binary sentiment classification: Positive/Negative detection with confidence scores
- Comprehensive text preprocessing: @mention removal, URL cleaning, stopword filtering, and Porter stemming
- Model transparency: See how text is processed before prediction
- Sample text testing: Pre-loaded examples for quick demonstrations
Technology Stack
- ML/Data: Pandas, NumPy, Scikit-learn
- NLP: NLTK (Tokenization, Stemming, Stopwords)
- Vectorization: TF-IDF (5,000 features, unigrams + bigrams)
- Model: Logistic Regression with L2 regularization
- Web App: Streamlit
- Visualization: Matplotlib, Seaborn
