Project 01 Classification · NLP
📩 Spam Email Classifier
Train a model to detect and filter spam emails by analysing message content with Natural Language Processing.
- Extract features from raw email text using TF-IDF vectorisation
- Train Naive Bayes and Logistic Regression classifiers and compare accuracy
- Evaluate with Precision, Recall, F1-Score — explain why Recall matters more for spam
- Build a predict function: input any email string, output Spam or Not Spam
NLPTF-IDFNaive BayesF1-Score
Project 02 Regression
🏠 House Price Predictor
Predict residential property prices from location, size, age, and neighbourhood features using regression models.
- Perform full EDA: distributions, outliers, and correlation heatmap with seaborn
- Engineer features: price per sqft, property age, distance bucket to city centre
- Compare Linear Regression vs Random Forest Regressor using cross-validation
- Report MSE, RMSE, R2 and identify the top 3 features that drive house price
RegressionEDAFeature Eng.Random Forest
Project 03 Classification · Imbalanced
💊 Credit Card Fraud Detector
Detect fraudulent transactions in a heavily imbalanced dataset where only 0.17% of transactions are fraud.
- Demonstrate the accuracy paradox: a model predicting all-legit scores 99.83% but misses all fraud
- Apply SMOTE oversampling to balance classes before training
- Train Random Forest and evaluate with ROC-AUC and F1 (not accuracy)
- Set decision threshold to minimise false negatives — missed fraud is the worst error
ImbalancedSMOTEROC-AUCThreshold
Project 04 Clustering · Unsupervised
👥 Customer Segmentation
Group e-commerce customers by purchasing behaviour using RFM analysis and KMeans clustering.
- Compute RFM features: Recency, Frequency, and Monetary value per customer
- Use Elbow Method to find optimal K, then apply KMeans clustering
- Reduce to 2D with PCA and visualise clusters as a colour-coded scatter plot
- Name and describe each segment: Champions, At-Risk, Hibernating, New Customers
RFMKMeansPCASegmentation
Project 05 NLP · Multi-class
🥳 Product Sentiment Analyser
Classify product reviews as Positive, Neutral, or Negative to help businesses monitor customer satisfaction at scale.
- Clean and preprocess text: lowercase, remove punctuation, stopwords, and lemmatise
- Vectorise reviews with TF-IDF and train multi-class Logistic Regression
- Report per-class Precision, Recall, F1 — identify the hardest sentiment to predict
- Build a live predict function that takes any review string and returns sentiment
NLPText CleaningTF-IDFMulti-class
Project 06 Healthcare · Classification
💊 Diabetes Risk Predictor
Predict whether a patient is at risk of diabetes using clinical measurements: glucose, BMI, blood pressure, and age.
- Handle missing values coded as zeros — physiologically impossible readings need imputation
- Train Logistic Regression, Decision Tree, and Random Forest, then compare all three
- Plot feature importances and identify the top 3 clinical predictors of diabetes risk
- Justify metric choice: explain why Recall matters more than Precision in medical AI
HealthcareClinical DataFeature Importance
Project 07 Time Series · Regression
📈 Retail Sales Forecaster
Predict future weekly sales for a retail chain using historical transaction data and engineered time features.
- Engineer time features: day of week, month, quarter, is_holiday, rolling 4-week average
- Train a Random Forest Regressor on all engineered features using cross-validation
- Evaluate with MAE and MAPE (Mean Absolute Percentage Error)
- Plot predicted vs actual sales on a time-series line chart and identify seasonal patterns
Time SeriesFeature Eng.MAE/MAPEForecasting
Project 08 Recommender · NLP
🎬 Movie Recommendation Engine
Build a content-based recommendation system that suggests similar movies based on genre, cast, director, and keywords.
- Combine genre, cast, director, and keyword metadata into a single feature string per movie
- Compute TF-IDF vectors and cosine similarity matrix across all movie pairs
- Build a function: input any movie title, return the top 10 most similar movies
- Explain the difference between content-based and collaborative filtering approaches
RecommenderCosine SimilarityContent-Based
Project 09 Computer Vision · Classification
📷 Handwritten Digit Classifier
Classify handwritten digits 0-9 from 28x28 pixel images using the MNIST dataset — the Hello World of deep learning.
- Flatten each 28x28 image to a 784-element feature vector and normalise pixel values 0-1
- Train Random Forest and a simple MLP Neural Network, then compare test accuracy
- Plot a confusion matrix heatmap to visualise which digits the model confuses most
- Display 9 misclassified images with true vs predicted labels and explain the errors
Computer VisionMNISTNeural NetConfusion Matrix
Project 10 Regression · Energy
⚡ Energy Consumption Predictor
Predict hourly electricity demand for a city using weather conditions, time-of-day, and economic activity indicators.
- Encode cyclical time features using sin/cos pairs for hour and month — preserving circular nature
- Handle weather sensor outliers and missing readings using median imputation
- Compare Gradient Boosting Regressor vs Linear Regression using 5-fold cross-validation
- Report feature importances and identify which factors cause electricity demand peaks
Energy AICyclical FeaturesGradient Boosting