Statistics for Data Science | Week 1 | Synkoc AI/ML Internship

Synkoc AI/ML Internship · Week 1 · Lesson 2 of 13

Statistics for
Data Science

The mathematics that powers every ML algorithm — Mean, Variance, Probability & Correlation explained clearly with real AI applications.

📊 Mean & Median

📏 Variance

🎲 Probability

🔗 Correlation

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏱ ~55 minutes
🟢 Beginner Friendly

Why Statistics Powers ML

Before a model can learn, we must understand the data it will train on. Statistics gives us the tools to describe, summarise, and find patterns. Every ML algorithm is built on statistical foundations.

📊

Mean

Centre of data. Used in normalisation, loss functions, and model evaluation.

ML: feature scaling, loss

📏

Variance & Std Dev

How spread out data is. The bias-variance tradeoff in ML is built on this concept.

ML: overfitting detection

🎲

Probability

Likelihood of events. Every classification model outputs a probability between 0 and 1.

ML: classification output

🔗

Correlation

How two variables move together. Feature selection relies entirely on correlation analysis.

ML: feature selection

Chapter 1 of 4

Mean, Median & Mode

Three ways to measure the centre of data. The foundation of every summary statistic you will compute on a real dataset.

Mean, Median & Mode

Three measures of central tendency — each answers "what is the typical value?" in a different way. Know all three and when to use each one.

➕

Mean (Average)

Add all values, divide by count. Most common measure. Sensitive to outliers — one extreme value can make it misleading.

scores = [70, 80, 90, 60, 85]
mean = 385 / 5 = 77.0

⚡ ML: np.mean(), loss calculation, normalisation

📍

Median (Middle Value)

Sort, pick the middle value. Not affected by outliers — reliable for skewed data like salaries and house prices.

sorted: [60, 70, 80, 85, 90]
median = 80 (middle value)

⚡ ML: robust imputation, skewed features

🏆

Mode (Most Frequent)

The value that appears most often. The only measure that works on categorical data — labels, colours, categories.

labels = ["A","B","A","C","A"]
mode = "A" (appears 3 times)

⚡ ML: class imbalance, majority baseline

💡

When to use which?

Use mean for symmetric numeric data. Use median when outliers exist. Use mode for categories and class labels.

House prices → MEDIAN
Exam scores → MEAN
Eye colour → MODE

⚡ Always check for outliers first

Chapter 2 of 4

Variance & Standard Deviation

Measure how spread out data is. The foundation of the bias-variance tradeoff — the most important concept in ML engineering.

Variance & Standard Deviation

Mean tells you the centre. Variance tells you the spread. Two datasets can have the same mean but completely different variance — changing everything in ML.

📏

Variance = Average Squared Distance from Mean

For each value: subtract the mean and square the result. Average all those squared differences. Square root of variance = standard deviation — in the same units as your original data.

scores = [70, 80, 90, 60, 85]
mean = 77.0
diffs = [-7, 3, 13, -17, 8]
sq_diffs = [49, 9, 169, 289, 64]
variance = (49+9+169+289+64) / 5 = 116.0
std_dev = sqrt(116) ≈ 10.77

⚡ML: High model variance = overfitting. Low variance = underfitting. The bias-variance tradeoff is the #1 engineering challenge in machine learning.

What Spread Looks Like

Two classes, same mean score of 75. But their spreads are completely different — and this changes how a model treats them.

Class A — Low Variance

σ² = 12.5 · Consistent results

7274767778

✅ Easy for ML — predictable

Class B — High Variance

σ² = 284 · Highly inconsistent

110

40607590110

⚠️ Harder for ML — noisy data

🤖

Why this matters in your ML projects

In Pandas (Week 2), a column with std dev near zero is useless — every row has almost the same value. A column with very high std dev may need normalisation before training. Always check variance before modelling.

mean_variance_demo.py● LIVE

Chapter 3 of 4

Probability

The language of uncertainty. Every classification model output is a probability. Understanding this means understanding what your model is actually saying.

Understanding Probability

Probability measures how likely an event is — a number between 0 and 1. Zero means impossible. One means certain. Everything in between is uncertainty.

🎲

P(event) = favourable / total

Count how many ways an event can happen, divide by all possible outcomes. Result is always between 0 and 1. Multiply by 100 for a percentage.

P(pass) = students_who_passed / total
P(pass) = 80 / 100 = 0.80 = 80%

P(email is spam) = 0.95 → 95% likely spam
P(rain tomorrow) = 0.30 → 30% chance

⚡ML Connection: When Logistic Regression outputs 0.87 for "spam", it means 87% probability this is spam. Every classifier outputs probabilities — not just yes or no. You choose a threshold to convert to a decision.

📋

Conditional Probability

P(A|B) = probability of A given B has happened. Foundation of Naive Bayes classifiers used for spam detection and text classification.

P(spam | contains "prize") = 0.92
P(pass | attendance > 80%) = 0.88

🔔

Normal Distribution

Most natural data follows a bell curve — values cluster near the mean. Many ML algorithms assume normally distributed features as input.

68% within 1 std dev
95% within 2 std dev
99.7% within 3 std dev

Probability Analogy

Synkoc Instructor Analogy

"Every morning you check the weather forecast. It says 70% chance of rain. That 70% is a probability — it does not guarantee rain, it tells you how confident the model is based on historical patterns. Machine learning classification works identically. Your spam detector does not say 'this is definitely spam' — it says 'I am 94% confident this is spam based on patterns from thousands of examples'. Probability is how your model expresses confidence. You set a threshold to convert that confidence into a decision."

🤖

In Every Model You Build at Synkoc

In Week 3, model.predict_proba(X_test) returns a probability for every prediction. You choose a threshold — typically 0.5 — above which you classify as positive. Raise it to 0.9 and you only flag high-confidence predictions. Lower it to 0.3 and you cast a wider net. Understanding probability means you can tune this intelligently.

Chapter 4 of 4

Correlation

How do two variables move together? Correlation reveals feature relationships — the foundation of feature selection in machine learning.

Understanding Correlation

Correlation measures the strength and direction of the linear relationship between two variables. Pearson r ranges from -1 to +1.

🔗

r = -1 to 0 to +1

r = +1: perfect positive — as X increases, Y increases. r = -1: perfect negative — as X increases, Y decreases. r = 0: no linear relationship. Use |r| for strength regardless of direction.

r = +0.92 → Strong positive (study hours vs score)
r = -0.78 → Strong negative (absences vs grade)
r = +0.12 → Weak — likely noise, consider removing
r = 0.00 → No relationship at all

⚡Feature Selection: Before training, compute r between every feature and the target. Keep features with |r| > 0.5. Remove features near 0 — they add noise, not signal. Also remove redundant features that are highly correlated with each other.

correlation_demo.py● LIVE

All Concepts Together — Mini EDA Report

mini_eda_synkoc.pyComplete EDA

1import math

2hours = [2,4,6,8,10,3,7,5,9,1]

3scores = [35,60,72,88,97,45,82,68,93,30]

4def mean(d): return sum(d)/len(d)

5def std(d): m=mean(d); return math.sqrt(mean([(x-m)**2 for x in d]))

6print("=== EDA Report ===")

7print(f"Hours — Mean:{mean(hours):.1f} | Std:{std(hours):.2f}")

8print(f"Scores — Mean:{mean(scores):.1f} | Std:{std(scores):.2f}")

9passing = [s for s in scores if s >= 60]

10print(f"Pass rate: {len(passing)/len(scores)*100:.1f}%")

This mini EDA computes mean and std dev for both variables and calculates the pass rate as a probability. In Week 2, df.describe() in Pandas produces all these statistics in one line — but now you understand exactly what each number means.

Lesson Summary

You have completed Statistics for Data Science. Here is what you can now do:

📊

Mean, Median, Mode

Calculate all three and know when to use each. Mean for symmetric data, median when outliers exist, mode for categories and class labels.

📏

Variance & Std Dev

Calculate spread from scratch. Understand that the bias-variance tradeoff in ML — overfitting vs underfitting — is built on this exact concept.

🎲

Probability

Understand P(event) = favourable/total. Know that every classifier outputs a probability, and that you choose a threshold to convert it to a decision.

🔗

Correlation

Interpret r from -1 to +1. Use correlation for feature selection. Features with |r| near 0 are noise — remove them before training your ML model.

📊

Week 1 Theory Complete!

Both modules done. Open the Statistics Practical Lab to practise. Complete the lab and quiz, then move on to Week 2: NumPy, Pandas, Data Visualisation & EDA.

✅ Video — Done

✏️ Practical Lab — Next

❓ Quiz — After Lab

Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023