Capstone Lab | Week 4 | Synkoc AI/ML Internship

Project 01 Classification · NLP

📩 Spam Email Classifier

Train a model to detect and filter spam emails by analysing message content with Natural Language Processing.

Extract features from raw email text using TF-IDF vectorisation
Train Naive Bayes and Logistic Regression classifiers and compare accuracy
Evaluate with Precision, Recall, F1-Score — explain why Recall matters more for spam
Build a predict function: input any email string, output Spam or Not Spam

NLPTF-IDFNaive BayesF1-Score

Project 02 Regression

🏠 House Price Predictor

Predict residential property prices from location, size, age, and neighbourhood features using regression models.

Perform full EDA: distributions, outliers, and correlation heatmap with seaborn
Engineer features: price per sqft, property age, distance bucket to city centre
Compare Linear Regression vs Random Forest Regressor using cross-validation
Report MSE, RMSE, R2 and identify the top 3 features that drive house price

RegressionEDAFeature Eng.Random Forest

Project 03 Classification · Imbalanced

💊 Credit Card Fraud Detector

Detect fraudulent transactions in a heavily imbalanced dataset where only 0.17% of transactions are fraud.

Demonstrate the accuracy paradox: a model predicting all-legit scores 99.83% but misses all fraud
Apply SMOTE oversampling to balance classes before training
Train Random Forest and evaluate with ROC-AUC and F1 (not accuracy)
Set decision threshold to minimise false negatives — missed fraud is the worst error

ImbalancedSMOTEROC-AUCThreshold

Project 04 Clustering · Unsupervised

👥 Customer Segmentation

Group e-commerce customers by purchasing behaviour using RFM analysis and KMeans clustering.

Compute RFM features: Recency, Frequency, and Monetary value per customer
Use Elbow Method to find optimal K, then apply KMeans clustering
Reduce to 2D with PCA and visualise clusters as a colour-coded scatter plot
Name and describe each segment: Champions, At-Risk, Hibernating, New Customers

RFMKMeansPCASegmentation

Project 05 NLP · Multi-class

🥳 Product Sentiment Analyser

Classify product reviews as Positive, Neutral, or Negative to help businesses monitor customer satisfaction at scale.

Clean and preprocess text: lowercase, remove punctuation, stopwords, and lemmatise
Vectorise reviews with TF-IDF and train multi-class Logistic Regression
Report per-class Precision, Recall, F1 — identify the hardest sentiment to predict
Build a live predict function that takes any review string and returns sentiment

NLPText CleaningTF-IDFMulti-class

Project 06 Healthcare · Classification

💊 Diabetes Risk Predictor

Predict whether a patient is at risk of diabetes using clinical measurements: glucose, BMI, blood pressure, and age.

Handle missing values coded as zeros — physiologically impossible readings need imputation
Train Logistic Regression, Decision Tree, and Random Forest, then compare all three
Plot feature importances and identify the top 3 clinical predictors of diabetes risk
Justify metric choice: explain why Recall matters more than Precision in medical AI

HealthcareClinical DataFeature Importance

Project 07 Time Series · Regression

📈 Retail Sales Forecaster

Predict future weekly sales for a retail chain using historical transaction data and engineered time features.

Engineer time features: day of week, month, quarter, is_holiday, rolling 4-week average
Train a Random Forest Regressor on all engineered features using cross-validation
Evaluate with MAE and MAPE (Mean Absolute Percentage Error)
Plot predicted vs actual sales on a time-series line chart and identify seasonal patterns

Time SeriesFeature Eng.MAE/MAPEForecasting

Project 08 Recommender · NLP

🎬 Movie Recommendation Engine

Build a content-based recommendation system that suggests similar movies based on genre, cast, director, and keywords.

Combine genre, cast, director, and keyword metadata into a single feature string per movie
Compute TF-IDF vectors and cosine similarity matrix across all movie pairs
Build a function: input any movie title, return the top 10 most similar movies
Explain the difference between content-based and collaborative filtering approaches

RecommenderCosine SimilarityContent-Based

Project 09 Computer Vision · Classification

📷 Handwritten Digit Classifier

Classify handwritten digits 0-9 from 28x28 pixel images using the MNIST dataset — the Hello World of deep learning.

Flatten each 28x28 image to a 784-element feature vector and normalise pixel values 0-1
Train Random Forest and a simple MLP Neural Network, then compare test accuracy
Plot a confusion matrix heatmap to visualise which digits the model confuses most
Display 9 misclassified images with true vs predicted labels and explain the errors

Computer VisionMNISTNeural NetConfusion Matrix

Project 10 Regression · Energy

⚡ Energy Consumption Predictor

Predict hourly electricity demand for a city using weather conditions, time-of-day, and economic activity indicators.

Encode cyclical time features using sin/cos pairs for hour and month — preserving circular nature
Handle weather sensor outliers and missing readings using median imputation
Compare Gradient Boosting Regressor vs Linear Regression using 5-fold cross-validation
Report feature importances and identify which factors cause electricity demand peaks

Energy AICyclical FeaturesGradient Boosting

Choose Your 2 Projects

Capstone Complete!

🌟 Certificate of Completion