Chapters:
Synkoc AI/ML Internship · Week 2 · Lessons 5 & 6 of 13
Data Visualisation
& EDA
A model you cannot explain is a model you cannot trust. Visualisation reveals hidden patterns before you train. EDA is how every expert starts a project.
📈 Matplotlib
📇 Seaborn
📊 EDA Workflow
🔍 Correlation
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏳ ~55 minutes
✌ Intermediate
Why Visualise Before Training?
Numbers alone hide patterns. Charts reveal outliers, class imbalance, feature distributions, and correlations. Every Kaggle grandmaster starts with EDA. You should too.
📈
Spot Outliers
A histogram shows if your data has extreme values that will distort model training.
Outlier detection
🔗
Correlation
A heatmap reveals which features predict the target and which are noise to remove.
Feature selection
Distribution
Is your data normally distributed or skewed? This determines which preprocessing to apply.
Preprocessing choice
Class Balance
A bar chart of your target shows if classes are imbalanced, which affects model evaluation.
Metric selection
Chapter 1 of 4
01
Matplotlib Fundamentals
The foundational plotting library. Line, bar, scatter, histogram. Axes, labels, titles — the basics done right.
Matplotlib — Core Charts
Four chart types you will use in every ML project. Each reveals a different aspect of your data.
📈
Line Chart
Best for trends over time. Training loss curve, accuracy per epoch, sales over months. Shows direction of change.
plt.figure(figsize=(8,4))
plt.plot(x, y, marker='o')
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
⚡ ML: plot training curves
📇
Bar Chart
Best for comparing categories. Class distribution, feature importance, accuracy per model. Shows quantity by group.
plt.bar(categories, values)
plt.title('Class Distribution')
plt.show()
⚡ ML: class imbalance check
Scatter Plot
Best for relationships between two variables. Shows correlation visually. Add a trend line with polyfit.
plt.scatter(x, y, alpha=0.7)
# add trend line
m,b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, 'r--')
⚡ ML: feature correlation check
📊
Histogram
Best for distribution of a single variable. Shows skew, modality, and spread. Essential for every numeric feature.
plt.hist(data, bins=20,
color='steelblue',
edgecolor='white')
plt.axvline(data.mean(), c='red')
⚡ ML: normality check
Chapter 2 of 4
02
Seaborn
Statistical visualisation on top of Matplotlib. Beautiful charts with one line. Heatmaps, pair plots, distribution plots.
Seaborn — Statistical Charts
Seaborn wraps Matplotlib with beautiful defaults and statistical chart types that are impractical to build from scratch.
📇
Key Seaborn charts for ML
These four charts answer the most important questions about your data before building any model. Run all four in your EDA notebook on every new dataset.
import seaborn as sns

sns.histplot(df['Age'], kde=True) # distribution + curve
sns.boxplot(x='Grade', y='Score', data=df) # outliers by group
sns.heatmap(df.corr(), annot=True) # correlation matrix
sns.pairplot(df) # all pairs at once
ML Pro Tip: Run sns.heatmap(df.corr(), annot=True) on every dataset. Features with absolute correlation above 0.7 with the target are strong predictors. Features near 0 are noise.
Chapter 3 of 4
03
EDA Workflow
A systematic process for understanding a new dataset from scratch. The exact workflow used by professional data scientists.
The 5-Step EDA Workflow
Every data scientist follows this sequence before building any model. Skipping EDA leads to bad models. Every Kaggle grandmaster starts here.
🔍
1. Shape & Types
df.shape, df.info(), df.dtypes. How many rows, columns, what types? Any object columns that need encoding?
Always first step
🧹
2. Missing Values
df.isnull().sum(). Which columns have missing data? How much? Decides whether to fill or drop.
Preprocessing plan
📊
3. Distributions
df.describe(). Histogram per feature. Are distributions normal? Are there outliers? Extreme max values?
Feature engineering
🔗
4. Correlations
df.corr() heatmap. Which features predict the target? Any redundant features? Remove noise before training.
Feature selection
Synkoc Instructor Analogy
"EDA is like a doctor's check-up before prescribing medicine. You would never prescribe medicine without examining the patient first. You should never train a model without examining your data first. The check-up takes 30 minutes. It saves you days of debugging a bad model."
Chapter 4 of 4
04
EDA in Practice
Putting it all together — a complete EDA example on a real dataset using both Matplotlib and Seaborn.
Complete EDA Workflow — Exam Score Dataset
eda_complete.py
# Step 1: Load & Shape
import pandas as pd, seaborn as sns
df = pd.read_csv('students.csv')
print(df.shape, df.dtypes)
# Step 2: Missing values
print(df.isnull().sum())
df['Score'] = df['Score'].fillna(df['Score'].median())
# Step 3: Distributions
sns.histplot(df['Score'], kde=True)
# Step 4: Correlations
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
Lesson Summary
You have completed Data Visualisation and EDA. Here is what you can now do:
📈
Matplotlib
Create line, bar, scatter, and histogram charts. Add titles, labels, and trend lines. figsize controls dimensions.
📇
Seaborn
Use histplot, boxplot, heatmap, and pairplot for statistical visualisation. annot=True shows correlation values.
🔍
EDA Workflow
5-step process: shape, missing values, distributions, correlations. Run before every model you build.
🔗
Correlation Analysis
Identify strong predictors with abs(r) above 0.7. Remove noise features near 0. Power your feature selection.
🌟
Week 2 Complete!
NumPy, Pandas, Visualisation, and EDA are done. Open the Practical Lab, complete the exercises, and take the Quiz. Week 3 brings Machine Learning algorithms!
✓ Video — Done
✏ Practical Lab — Next
❓ Quiz — After Lab
Synkoc IT Services · Bangalore · support@synkoc.com
Press ▶ Play to start the lesson with voice narration
0:00 / ~55:00
🔊
1 / 12