Chapters:
Synkoc AI/ML Internship · Week 1 · Lesson 2 of 13
Statistics for
Data Science
The mathematics that powers every ML algorithm — Mean, Variance, Probability & Correlation explained clearly with real AI applications.
📊 Mean & Median
📏 Variance
🎲 Probability
🔗 Correlation
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏱ ~55 minutes
🟢 Beginner Friendly
Why Statistics Powers ML
Before a model can learn, we must understand the data it will train on. Statistics gives us the tools to describe, summarise, and find patterns. Every ML algorithm is built on statistical foundations.
📊
Mean
Centre of data. Used in normalisation, loss functions, and model evaluation.
ML: feature scaling, loss
📏
Variance & Std Dev
How spread out data is. The bias-variance tradeoff in ML is built on this concept.
ML: overfitting detection
🎲
Probability
Likelihood of events. Every classification model outputs a probability between 0 and 1.
ML: classification output
🔗
Correlation
How two variables move together. Feature selection relies entirely on correlation analysis.
ML: feature selection
Chapter 1 of 4
01
Mean, Median & Mode
Three ways to measure the centre of data. The foundation of every summary statistic you will compute on a real dataset.
Mean, Median & Mode
Three measures of central tendency — each answers "what is the typical value?" in a different way. Know all three and when to use each one.
Mean (Average)
Add all values, divide by count. Most common measure. Sensitive to outliers — one extreme value can make it misleading.
scores = [70, 80, 90, 60, 85]
mean = 385 / 5 = 77.0
⚡ ML: np.mean(), loss calculation, normalisation
📍
Median (Middle Value)
Sort, pick the middle value. Not affected by outliers — reliable for skewed data like salaries and house prices.
sorted: [60, 70, 80, 85, 90]
median = 80 (middle value)
⚡ ML: robust imputation, skewed features
🏆
Mode (Most Frequent)
The value that appears most often. The only measure that works on categorical data — labels, colours, categories.
labels = ["A","B","A","C","A"]
mode = "A" (appears 3 times)
⚡ ML: class imbalance, majority baseline
💡
When to use which?
Use mean for symmetric numeric data. Use median when outliers exist. Use mode for categories and class labels.
House prices → MEDIAN
Exam scores → MEAN
Eye colour → MODE
⚡ Always check for outliers first
Chapter 2 of 4
02
Variance & Standard Deviation
Measure how spread out data is. The foundation of the bias-variance tradeoff — the most important concept in ML engineering.
Variance & Standard Deviation
Mean tells you the centre. Variance tells you the spread. Two datasets can have the same mean but completely different variance — changing everything in ML.
📏
Variance = Average Squared Distance from Mean
For each value: subtract the mean and square the result. Average all those squared differences. Square root of variance = standard deviation — in the same units as your original data.
scores = [70, 80, 90, 60, 85]
mean = 77.0
diffs = [-7, 3, 13, -17, 8]
sq_diffs = [49, 9, 169, 289, 64]
variance = (49+9+169+289+64) / 5 = 116.0
std_dev = sqrt(116) ≈ 10.77
ML: High model variance = overfitting. Low variance = underfitting. The bias-variance tradeoff is the #1 engineering challenge in machine learning.
What Spread Looks Like
Two classes, same mean score of 75. But their spreads are completely different — and this changes how a model treats them.
Class A — Low Variance
σ² = 12.5 · Consistent results
72
74
76
77
78
7274767778
✅ Easy for ML — predictable
Class B — High Variance
σ² = 284 · Highly inconsistent
40
60
75
90
110
40607590110
⚠️ Harder for ML — noisy data
🤖
Why this matters in your ML projects
In Pandas (Week 2), a column with std dev near zero is useless — every row has almost the same value. A column with very high std dev may need normalisation before training. Always check variance before modelling.
mean_variance_demo.py● LIVE
Chapter 3 of 4
03
Probability
The language of uncertainty. Every classification model output is a probability. Understanding this means understanding what your model is actually saying.
Understanding Probability
Probability measures how likely an event is — a number between 0 and 1. Zero means impossible. One means certain. Everything in between is uncertainty.
🎲
P(event) = favourable / total
Count how many ways an event can happen, divide by all possible outcomes. Result is always between 0 and 1. Multiply by 100 for a percentage.
P(pass) = students_who_passed / total
P(pass) = 80 / 100 = 0.80 = 80%

P(email is spam) = 0.95 → 95% likely spam
P(rain tomorrow) = 0.30 → 30% chance
ML Connection: When Logistic Regression outputs 0.87 for "spam", it means 87% probability this is spam. Every classifier outputs probabilities — not just yes or no. You choose a threshold to convert to a decision.
📋
Conditional Probability
P(A|B) = probability of A given B has happened. Foundation of Naive Bayes classifiers used for spam detection and text classification.
P(spam | contains "prize") = 0.92
P(pass | attendance > 80%) = 0.88
🔔
Normal Distribution
Most natural data follows a bell curve — values cluster near the mean. Many ML algorithms assume normally distributed features as input.
68% within 1 std dev
95% within 2 std dev
99.7% within 3 std dev
Probability Analogy
Synkoc Instructor Analogy
"Every morning you check the weather forecast. It says 70% chance of rain. That 70% is a probability — it does not guarantee rain, it tells you how confident the model is based on historical patterns. Machine learning classification works identically. Your spam detector does not say 'this is definitely spam' — it says 'I am 94% confident this is spam based on patterns from thousands of examples'. Probability is how your model expresses confidence. You set a threshold to convert that confidence into a decision."
🤖
In Every Model You Build at Synkoc
In Week 3, model.predict_proba(X_test) returns a probability for every prediction. You choose a threshold — typically 0.5 — above which you classify as positive. Raise it to 0.9 and you only flag high-confidence predictions. Lower it to 0.3 and you cast a wider net. Understanding probability means you can tune this intelligently.
Chapter 4 of 4
04
Correlation
How do two variables move together? Correlation reveals feature relationships — the foundation of feature selection in machine learning.
Understanding Correlation
Correlation measures the strength and direction of the linear relationship between two variables. Pearson r ranges from -1 to +1.
🔗
r = -1 to 0 to +1
r = +1: perfect positive — as X increases, Y increases. r = -1: perfect negative — as X increases, Y decreases. r = 0: no linear relationship. Use |r| for strength regardless of direction.
r = +0.92 → Strong positive (study hours vs score)
r = -0.78 → Strong negative (absences vs grade)
r = +0.12 → Weak — likely noise, consider removing
r = 0.00 → No relationship at all
Feature Selection: Before training, compute r between every feature and the target. Keep features with |r| > 0.5. Remove features near 0 — they add noise, not signal. Also remove redundant features that are highly correlated with each other.
correlation_demo.py● LIVE
All Concepts Together — Mini EDA Report
mini_eda_synkoc.pyComplete EDA
1import math
2hours = [2,4,6,8,10,3,7,5,9,1]
3scores = [35,60,72,88,97,45,82,68,93,30]
4def mean(d): return sum(d)/len(d)
5def std(d): m=mean(d); return math.sqrt(mean([(x-m)**2 for x in d]))
6print("=== EDA Report ===")
7print(f"Hours — Mean:{mean(hours):.1f} | Std:{std(hours):.2f}")
8print(f"Scores — Mean:{mean(scores):.1f} | Std:{std(scores):.2f}")
9passing = [s for s in scores if s >= 60]
10print(f"Pass rate: {len(passing)/len(scores)*100:.1f}%")
This mini EDA computes mean and std dev for both variables and calculates the pass rate as a probability. In Week 2, df.describe() in Pandas produces all these statistics in one line — but now you understand exactly what each number means.
Lesson Summary
You have completed Statistics for Data Science. Here is what you can now do:
📊
Mean, Median, Mode
Calculate all three and know when to use each. Mean for symmetric data, median when outliers exist, mode for categories and class labels.
📏
Variance & Std Dev
Calculate spread from scratch. Understand that the bias-variance tradeoff in ML — overfitting vs underfitting — is built on this exact concept.
🎲
Probability
Understand P(event) = favourable/total. Know that every classifier outputs a probability, and that you choose a threshold to convert it to a decision.
🔗
Correlation
Interpret r from -1 to +1. Use correlation for feature selection. Features with |r| near 0 are noise — remove them before training your ML model.
📊
Week 1 Theory Complete!
Both modules done. Open the Statistics Practical Lab to practise. Complete the lab and quiz, then move on to Week 2: NumPy, Pandas, Data Visualisation & EDA.
✅ Video — Done
✏️ Practical Lab — Next
❓ Quiz — After Lab
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023
Press ▶ Play to start the lesson with voice narration
0:00 / ~55:00
🔊
1 / 17