Data Visualisation & EDA | Week 2 | Synkoc AI/ML Internship

Synkoc AI/ML Internship · Week 2 · Lessons 5 & 6 of 13

Data Visualisation
& EDA

A model you cannot explain is a model you cannot trust. Visualisation reveals hidden patterns before you train. EDA is how every expert starts a project.

📈 Matplotlib

📇 Seaborn

📊 EDA Workflow

🔍 Correlation

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏳ ~55 minutes
✌ Intermediate

Why Visualise Before Training?

Numbers alone hide patterns. Charts reveal outliers, class imbalance, feature distributions, and correlations. Every Kaggle grandmaster starts with EDA. You should too.

📈

Spot Outliers

A histogram shows if your data has extreme values that will distort model training.

Outlier detection

🔗

Correlation

A heatmap reveals which features predict the target and which are noise to remove.

Feature selection

⚖

Distribution

Is your data normally distributed or skewed? This determines which preprocessing to apply.

Preprocessing choice

⚖

Class Balance

A bar chart of your target shows if classes are imbalanced, which affects model evaluation.

Metric selection

Chapter 1 of 4

Matplotlib Fundamentals

The foundational plotting library. Line, bar, scatter, histogram. Axes, labels, titles — the basics done right.

Matplotlib — Core Charts

Four chart types you will use in every ML project. Each reveals a different aspect of your data.

📈

Line Chart

Best for trends over time. Training loss curve, accuracy per epoch, sales over months. Shows direction of change.

plt.figure(figsize=(8,4))
plt.plot(x, y, marker='o')
plt.title('Training Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

⚡ ML: plot training curves

📇

Bar Chart

Best for comparing categories. Class distribution, feature importance, accuracy per model. Shows quantity by group.

plt.bar(categories, values)
plt.title('Class Distribution')
plt.show()

⚡ ML: class imbalance check

◈

Scatter Plot

Best for relationships between two variables. Shows correlation visually. Add a trend line with polyfit.

plt.scatter(x, y, alpha=0.7)
# add trend line
m,b = np.polyfit(x, y, 1)
plt.plot(x, m*x+b, 'r--')

⚡ ML: feature correlation check

📊

Histogram

Best for distribution of a single variable. Shows skew, modality, and spread. Essential for every numeric feature.

plt.hist(data, bins=20,
color='steelblue',
edgecolor='white')
plt.axvline(data.mean(), c='red')

⚡ ML: normality check

Chapter 2 of 4

Seaborn

Statistical visualisation on top of Matplotlib. Beautiful charts with one line. Heatmaps, pair plots, distribution plots.

Seaborn — Statistical Charts

Seaborn wraps Matplotlib with beautiful defaults and statistical chart types that are impractical to build from scratch.

📇

Key Seaborn charts for ML

These four charts answer the most important questions about your data before building any model. Run all four in your EDA notebook on every new dataset.

import seaborn as sns

sns.histplot(df['Age'], kde=True) # distribution + curve
sns.boxplot(x='Grade', y='Score', data=df) # outliers by group
sns.heatmap(df.corr(), annot=True) # correlation matrix
sns.pairplot(df) # all pairs at once

⚡ML Pro Tip: Run sns.heatmap(df.corr(), annot=True) on every dataset. Features with absolute correlation above 0.7 with the target are strong predictors. Features near 0 are noise.

Chapter 3 of 4

EDA Workflow

A systematic process for understanding a new dataset from scratch. The exact workflow used by professional data scientists.

The 5-Step EDA Workflow

Every data scientist follows this sequence before building any model. Skipping EDA leads to bad models. Every Kaggle grandmaster starts here.

🔍

1. Shape & Types

df.shape, df.info(), df.dtypes. How many rows, columns, what types? Any object columns that need encoding?

Always first step

🧹

2. Missing Values

df.isnull().sum(). Which columns have missing data? How much? Decides whether to fill or drop.

Preprocessing plan

📊

3. Distributions

df.describe(). Histogram per feature. Are distributions normal? Are there outliers? Extreme max values?

Feature engineering

🔗

4. Correlations

df.corr() heatmap. Which features predict the target? Any redundant features? Remove noise before training.

Feature selection

Synkoc Instructor Analogy

"EDA is like a doctor's check-up before prescribing medicine. You would never prescribe medicine without examining the patient first. You should never train a model without examining your data first. The check-up takes 30 minutes. It saves you days of debugging a bad model."

Chapter 4 of 4

EDA in Practice

Putting it all together — a complete EDA example on a real dataset using both Matplotlib and Seaborn.

Complete EDA Workflow — Exam Score Dataset

eda_complete.py

# Step 1: Load & Shape

import pandas as pd, seaborn as sns

df = pd.read_csv('students.csv')

print(df.shape, df.dtypes)

# Step 2: Missing values

print(df.isnull().sum())

df['Score'] = df['Score'].fillna(df['Score'].median())

# Step 3: Distributions

sns.histplot(df['Score'], kde=True)

# Step 4: Correlations

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

Lesson Summary

You have completed Data Visualisation and EDA. Here is what you can now do:

📈

Matplotlib

Create line, bar, scatter, and histogram charts. Add titles, labels, and trend lines. figsize controls dimensions.

📇

Seaborn

Use histplot, boxplot, heatmap, and pairplot for statistical visualisation. annot=True shows correlation values.

🔍

EDA Workflow

5-step process: shape, missing values, distributions, correlations. Run before every model you build.

🔗

Correlation Analysis

Identify strong predictors with abs(r) above 0.7. Remove noise features near 0. Power your feature selection.

🌟

Week 2 Complete!

NumPy, Pandas, Visualisation, and EDA are done. Open the Practical Lab, complete the exercises, and take the Quiz. Week 3 brings Machine Learning algorithms!

✓ Video — Done

✏ Practical Lab — Next

❓ Quiz — After Lab

Synkoc IT Services · Bangalore · support@synkoc.com