Unsupervised Learning | Week 3 | Synkoc AI/ML Internship

Synkoc AI/ML Internship · Week 3 · Lesson 9 of 13

Unsupervised
Learning

Find hidden structure in unlabelled data. Master KMeans clustering, the Elbow Method, and PCA dimensionality reduction — the algorithms behind customer segmentation, anomaly detection, and data compression.

🎨 KMeans Clustering

📈 Elbow Method

🔆 PCA

👥 Segmentation

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏳ ~50 minutes
🟢 Intermediate

What is Unsupervised Learning?

No labels. No right answers. The algorithm finds hidden patterns and structure in raw data entirely on its own.

🏗

Supervised vs Unsupervised

Supervised: X and y provided. Learn X→y mapping.
Unsupervised: Only X provided. Find structure within X itself.

supervised: model.fit(X, y)
unsupervised: model.fit(X) # no y!

No labels means no grading — the data grades itself

🌍

Real-World Use Cases

Customer segmentation, document topic modelling, anomaly detection, image compression, recommendation systems — all unsupervised at core.

Netflix: clusters users by taste
Spotify: groups songs by sound
Banks: flags unusual transactions

Billion-dollar products built on these algorithms

📈

Clustering

Group similar items together automatically. KMeans, DBSCAN, Hierarchical Clustering. Items in the same cluster are more similar to each other than to items in other clusters.

km = KMeans(n_clusters=3)
labels = km.fit_predict(X)

KMeans is the most-used clustering algorithm

🔆

Dimensionality Reduction

Compress many features into fewer while preserving structure. PCA, t-SNE, UMAP. Used for visualisation, denoising, and speeding up downstream ML.

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

PCA turns 100 features into 2 visualisable ones

Chapter 1 of 4

01

KMeans Clustering

The foundational clustering algorithm. Partitions data into K groups by minimising the distance between each point and its assigned centroid.

How KMeans Works

Four repeating steps converge to stable cluster assignments. The algorithm minimises within-cluster variance — called inertia.

🎨

The KMeans Algorithm

Step 1: Place K centroids randomly. Step 2: Assign every point to nearest centroid. Step 3: Move each centroid to the mean of its assigned points. Step 4: Repeat steps 2-3 until assignments stop changing — convergence.

from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, random_state=42)
km.fit(X)

labels = km.labels_ # cluster for each sample
centers = km.cluster_centers_ # centroid coordinates
inertia = km.inertia_ # total within-cluster variance

💡Three key attributes: labels_ gives cluster assignment for each point · cluster_centers_ gives centroid positions · inertia_ measures how tight the clusters are (lower = tighter)

kmeans_clustering.py● LIVE

Chapter 2 of 4

02

The Elbow Method

How do you choose K? The Elbow Method plots inertia vs K and finds the point where adding more clusters gives diminishing returns.

Choosing K with the Elbow Method

Run KMeans for K=1 to K=10. Plot inertia for each K. The optimal K is where the curve bends sharply — like an elbow.

📈

Why Inertia Decreases

More clusters always reduces inertia — in the extreme, K=N means every point is its own cluster with inertia=0. The Elbow finds the K where the improvement stops being worth the complexity.

inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

import matplotlib.pyplot as plt
plt.plot(K_range, inertias, marker='o')
plt.xlabel('Number of clusters K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()   # look for the bend/elbow in the curve

📚Industry standard: The Elbow Method is used at every data science team to find optimal K before deploying any clustering model to production.

elbow_method.py● LIVE

Chapter 3 of 4

03

PCA

Principal Component Analysis. Reduce 100 features to 2 or 3 while preserving the maximum possible variance. Essential for visualisation and preprocessing.

PCA: Dimensionality Reduction

PCA finds new axes — principal components — that capture maximum variance. Each component is a linear combination of the original features.

🔆

explained_variance_ratio_

The most important attribute. Tells you what fraction of total data variance each component captures. PC1 captures the most, PC2 the second most, and so on. Select enough components to explain 90-95% of variance.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X) # X: (n,100) → X_2d: (n,2)

print(pca.explained_variance_ratio_)
# [0.62, 0.18] → PC1 explains 62%, PC2 explains 18%
# Together: 80% of all variance retained in 2D

💡Rule of thumb: Choose n_components that captures 90-95% of explained variance. Use np.cumsum(pca.explained_variance_ratio_) to find the exact threshold.

pca_reduction.py● LIVE

Chapter 4 of 4

04

Full Unsupervised Pipeline

Combine PCA + KMeans + Elbow into one complete unsupervised ML pipeline. The professional workflow used at every data science team.

Complete Unsupervised Pipeline

unsupervised_pipeline.pyProduction Pipeline

1from sklearn.preprocessing import StandardScaler

2from sklearn.decomposition import PCA

3from sklearn.cluster import KMeans

4

5# Step 1: Scale (mandatory before PCA and KMeans)

6scaler = StandardScaler()

7X_scaled = scaler.fit_transform(X)

8

9# Step 2: PCA to retain 95% variance

10pca = PCA(n_components=0.95)

11X_pca = pca.fit_transform(X_scaled)

12print(f"Features: {X.shape[1]} → {X_pca.shape[1]}")

13

14# Step 3: Elbow to find optimal K

15inertias = [KMeans(k).fit(X_pca).inertia_ for k in range(1,11)]

16# Inspect plot, pick K=4 at the elbow

17km = KMeans(n_clusters=4, random_state=42)

18labels = km.fit_predict(X_pca)

19print(f"Segments: {dict(zip(*np.unique(labels,return_counts=True)))}")

This 3-step pipeline — Scale → PCA → KMeans — is the industry-standard unsupervised workflow. Used at Netflix, Spotify, Amazon, and every major tech company for customer segmentation and product clustering. Memorise this pattern.

Lesson Summary

You can now build complete unsupervised ML pipelines. Here is what you have mastered:

🎨

KMeans Clustering

Partition data into K groups. Access labels_, cluster_centers_, inertia_. Use fit_predict() for one-step fit and assign.

📈

Elbow Method

Loop K=1 to 10, plot inertia, find the bend. Always use the Elbow before choosing final K for any clustering project.

🔆

PCA

Reduce dimensionality while preserving variance. Use explained_variance_ratio_ to verify how much information is retained. n_components=0.95 auto-selects enough.

🌟

Full Pipeline

StandardScaler → PCA → KMeans. Always scale first. PCA before clustering improves speed and quality. This is production ML workflow.

🌟

Unsupervised Learning Complete!

You can now cluster data, choose K intelligently, and compress features with PCA. Open the Practical Lab to apply all three algorithms to real datasets.

✓ Video — Done

✎ Practical Lab — Next

❓ Quiz — After Lab

Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023