Chapters:
Synkoc AI/ML Internship · Week 3 · Lesson 7 of 13
Intro to
Machine Learning
What ML actually is. How models learn. The complete sklearn workflow. Overfitting vs underfitting, train-test split, and the bias-variance tradeoff — the most important concepts in all of ML.
🧠 What is ML
🎏 How Models Learn
⚖️ sklearn Workflow
📈 Bias-Variance
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏳ ~55 minutes
🟣 Week 3 Begin
What is Machine Learning?
Traditional programming: you write the rules. Machine learning: the algorithm finds the rules from data. Same goal, opposite approach. ML is not magic — it is optimisation.
💻
Traditional Programming
You write explicit rules. Data + Rules = Output. Works when you can specify every rule. Fails when rules are too complex to write.
if salary > 50000 and age < 30:
    approved = True
else:
    approved = False
Brittle: every new case needs a new rule
🧠
Machine Learning
You provide labelled examples. The algorithm finds the rules automatically. Data + Labels = Model that knows the rules.
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Model found the rules itself
model.predict(X_new)
Adapts: learns from new data automatically
🎓
How it learns
The algorithm makes predictions, compares to true labels, measures the error, and adjusts parameters to reduce error. Repeat thousands of times.
prediction = model.predict(X)
error = y_true - prediction
# Adjust model parameters
# Repeat until error is minimal
This is gradient descent at a high level
📊
Why now?
Three things converged: massive datasets (Internet), powerful hardware (GPUs), and better algorithms. All three at once made modern ML possible.
2024: GPT-4 trained on
~13 trillion tokens of text
~25,000 A100 GPUs
~$100 million compute budget
Scale makes the difference
Chapter 1 of 4
01
Types of ML
Three families of ML. Supervised, Unsupervised, and Reinforcement Learning. Each solves a different type of problem. You will implement all three this week.
Three Types of Machine Learning
Every ML algorithm belongs to one of three families. The type determines what data you need and what kind of problem you can solve.
📋
Supervised Learning
Training data has labels. The model learns the mapping from features to labels. Most common type in industry.
X = [[hours, age], ...]
y = [1, 0, 1, 1, 0, ...] # labels
model.fit(X, y) # learns mapping
⚡ Week 3 Lesson 2: Regression + Classification
📊
Unsupervised Learning
Training data has NO labels. The model finds hidden structure — clusters, patterns, compressions — on its own.
X = [[hours, age], ...]
# No y! No labels provided
kmeans.fit(X) # finds groups
kmeans.labels_ # discovered clusters
⚡ Week 3 Lesson 3: KMeans Clustering
🎮
Reinforcement Learning
Agent takes actions in an environment, receives rewards or penalties, and learns a policy to maximise long-term reward.
# No dataset needed
# Agent plays the game
# Reward: +1 win, -1 lose
# Learns optimal strategy
⚡ Used in: robotics, game AI, AlphaGo
🔍
Semi-supervised
Small labelled dataset + large unlabelled dataset. Learn from both. Very practical for real projects where labelling is expensive.
# Few expensive labels
# Many cheap unlabelled examples
# Use both for training
# Common in NLP and vision
⚡ Used when labelling is expensive
Chapter 2 of 4
02
The sklearn Workflow
Five steps. Every ML model in scikit-learn follows the exact same pattern. Learn it once, apply it to any algorithm. This is the most valuable pattern in all of applied ML.
The 5-Step sklearn Workflow
Every sklearn model — LinearRegression, RandomForest, KMeans, SVM — follows the identical 5-step pattern. Master this pattern and you know how to use any sklearn algorithm.
📚
1. Import
from sklearn.X import Model
📈
2. Split
train_test_split(X, y, 0.2)
🎓
3. Train
model.fit(X_train, y_train)
🔮
4. Predict
model.predict(X_test)
📉
5. Evaluate
accuracy_score(y_test, y_pred)
🤖
The Universal sklearn Pattern
This exact 5-step pattern works for LinearRegression, LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, KNeighborsClassifier, SVC, and every other sklearn estimator. Change the import on Step 1 and everything else stays identical. This is why sklearn is the most widely used ML library in the world.
sklearn_workflow.py● LIVE
Chapter 3 of 4
03
Train-Test Split
Why we split data. The critical difference between training performance and real-world performance. Get this wrong and your ML model is worthless.
Train-Test Split
Never evaluate your model on the data it trained on. That is like letting students grade their own exam papers. The model must prove it generalises to data it has never seen.
⚖️
80% train, 20% test — always separate
The training set is what the model learns from. The test set is held back completely and only used for final evaluation. A model that scores 99% on training data and 55% on test data has memorised the training data but learned nothing general. This is called overfitting.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,    # 20% for testing
    random_state=42   # reproducibility
)

print(X_train.shape) # 80% of data
print(X_test.shape)   # 20% of data
Rule: Train on X_train, y_train. Evaluate ONLY on X_test, y_test. Never look at test data before training is complete. Your test score is your real-world performance estimate.
Chapter 4 of 4
04
Bias-Variance Tradeoff
The single most important concept in all of machine learning. Understanding this tells you why your model fails and exactly how to fix it.
Overfitting vs Underfitting
Every ML failure is one of two things: the model is too simple (underfitting) or too complex (memorised the training data). The goal is the middle ground — just right.
📉
Underfitting (High Bias)
Model is too simple. Cannot capture the true pattern in data. Train accuracy AND test accuracy are both low. The model has not learned enough.
Train accuracy: 62%
Test accuracy: 60%
# Both low = underfitting
# Fix: more complex model
# Fix: more features
⚠ Increase model complexity
😀
Just Right (Sweet Spot)
Model generalises well. High training accuracy AND high test accuracy. Close gap between train and test performance. This is the goal.
Train accuracy: 92%
Test accuracy: 89%
# Small gap = good generalisation
# Keep this model!
✓ Ship this model
📈
Overfitting (High Variance)
Model memorised training data. High train accuracy but low test accuracy. Large gap between the two. Model fails on new data.
Train accuracy: 99%
Test accuracy: 58%
# Huge gap = overfitting
# Fix: regularisation
# Fix: more training data
⚠ Reduce complexity or add data
🔧
How to fix each
Underfitting: use more complex model, add polynomial features, train longer. Overfitting: regularisation (L1/L2), dropout, less complex model, more data, cross-validation.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores.mean())
⚡ cross_val_score gives reliable estimate
Lesson Summary
You have completed the ML foundations. Here is what you now understand deeply:
🧠
What ML Is
Data + Labels = Model that finds rules automatically. Three types: supervised (labels), unsupervised (no labels), reinforcement (rewards).
⚖️
sklearn Workflow
Import → Split → fit() → predict() → score(). Same 5 steps for every algorithm in sklearn. Change the import, keep everything else.
⚖️
Train-Test Split
Always split. 80% train, 20% test. random_state=42 for reproducibility. Never evaluate on training data. Test score = real-world estimate.
📈
Bias-Variance Tradeoff
Underfitting: both scores low, model too simple. Overfitting: huge train-test gap, model memorised data. Goal: small gap, both scores high.
🧠
ML Foundations Complete!
You understand what ML is and the universal sklearn workflow. Complete the Intro Lab. Next: Supervised Learning — training real regression and classification models.
✓ Video — Done
✏ Practical Lab — Next
❓ Quiz — After Lab
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023
Press ▶ Play to start the lesson with voice narration
0:00 / ~55:00
🔊
1 / 13