Exploratory Data Analysis | Week 2 | Synkoc AI/ML Internship

Synkoc AI/ML Internship · Week 2 · Lesson 6 of 13

Exploratory
Data Analysis

The complete Week 2 pipeline. Load → Inspect → Clean → Visualise → Feature Engineer → Export X and y. Your standard workflow before any ML model.

🔎 Inspect

🧹 Clean

📈 Visualise

⚙️ Feature Eng

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏳ ~55 minutes
🟢 Week 2 Capstone

What is EDA?

EDA is the detective work of data science. Before you build any model, you must understand your data deeply. What is the shape? What types? What is missing? What are the distributions? What correlates with what?

🔎

Phase 1: Inspect

Load data. Check shape, dtypes, null counts. head(), info(), describe(). First 5 minutes.

Always run first

🧹

Phase 2: Clean

Handle missing values. Fix wrong types. Remove duplicates. Drop irrelevant columns.

fillna, dropna, astype

📈

Phase 3: Visualise

Histograms, heatmap, scatter, box plots. See what statistics cannot show.

matplotlib + seaborn

⚙️

Phase 4: Feature Eng

Encode categories. Normalise numerics. Create X and y. Ready for any sklearn model.

Output: X.shape, y.shape

Chapter 1 of 4

Phase 1: Inspect

Six commands that tell you everything about a raw dataset. Run these before touching anything. In every ML project, every time.

Inspection Commands

These six commands run in sequence give you a complete picture of any dataset before you make any decisions.

📊

df.shape + df.columns

How many rows and columns? What are the feature names? First check to orient yourself.

print(df.shape)    # (1000, 8)
print(df.columns)  # Index(['age'...])
print(df.dtypes)   # age int64...

⚡ dtypes: if numeric shows 'object' → astype(float)

📝

df.head() + df.info()

head() shows first 5 rows as a table. info() shows dtypes, null counts, memory usage.

df.head()   # first 5 rows
df.tail()   # last 5 rows
df.info()   # dtypes + nulls
df.sample(5) # random 5 rows

⚡ info() Non-Null count < total = missing!

📉

df.describe()

Returns count, mean, std, min, 25%, 50%, 75%, max for every numeric column. All stats from Week 1 at once.

df.describe()
#        age     score
# mean   25.4     77.2
# std     4.1     12.8
# max    45.0     98.0

⚡ max much higher than 75% = outlier!

🔍

df.isnull().sum()

Count missing values per column. Reveals which columns need imputation and how much data is missing.

df.isnull().sum()
# age     0
# score   23 <-- fix this!
# city    12 <-- fix this!
# passed 0

⚡ Above 30% missing → consider dropping column

Chapter 2 of 4

Phase 2: Clean

Handle every missing value. Fix data types. Remove duplicates. Garbage in, garbage out — cleaning is the single most impactful step in any ML project.

Data Cleaning Playbook

Four cleaning actions in order. Always inspect before cleaning. Never impute the target column.

📋

Impute missing values

Fill numeric NaN with mean or median. Fill categorical NaN with mode. Drop rows if target is missing.

df['score'].fillna(df['score'].mean(), inplace=True)
df['city'].fillna(df['city'].mode()[0], inplace=True)
df.dropna(subset=['passed'], inplace=True)

⚡ Use median instead of mean for skewed columns

🔧

Fix data types

Numeric stored as strings is extremely common after CSV loading. Use astype() to fix.

df['score'] = df['score'].astype(float)
df['age'] = df['age'].astype(int)
# Dates:
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month

⚡ Always check dtypes after pd.read_csv()

🗑️

Remove duplicates

Duplicate rows inflate class counts and bias training. One line to detect and remove.

n_dups = df.duplicated().sum()
print(f"Duplicates: {n_dups}")
df = df.drop_duplicates()
print(f"Clean rows: {len(df)}")

⚡ Always check duplicates in real-world data

🎉

Encode categories

ML needs numbers. LabelEncoder converts text to integers. pd.get_dummies for one-hot encoding.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['city_enc'] = le.fit_transform(df['city'])
# Or one-hot encoding:
df = pd.get_dummies(df, columns=['city'])

⚡ Encode before separating X and y

eda_inspect_clean.py● LIVE

Chapter 3 of 4

Phase 3: Visualise

The 6-plot EDA checklist. Run every one before training any model. Each plot answers a specific question about your data that statistics alone cannot.

The 6-Plot EDA Checklist

Six visualisations that constitute a complete EDA. Each answers one critical question before you train any ML model.

1️⃣

Histogram per numeric column

plt.hist() on every numeric feature. Is distribution normal? Skewed? Bimodal? Decides normalisation strategy.

2️⃣

sns.countplot() on target

Is the target class balanced? If 95% class 0 and 5% class 1, you have class imbalance. Requires oversampling or weighted loss.

3️⃣

sns.heatmap(df.corr())

Correlation matrix. Keep features with |r|>0.5 vs target. Remove features with |r|>0.9 between each other (redundant).

4️⃣

plt.scatter() feature vs target

Confirms whether relationship is linear, non-linear, or absent. Decides which model family is appropriate.

5️⃣

sns.boxplot() per feature

Visualise outliers. The box shows IQR. Points beyond whiskers are outliers. Decide to keep, cap, or remove.

6️⃣

Missing value bar chart

df.isnull().sum().plot(kind='bar'). Instantly see which columns need imputation and how much data is missing.

Chapter 4 of 4

Phase 4: Feature Engineering

Produce the final X matrix and y labels. Normalise features. This is the handoff from data analysis to machine learning.

eda_complete_pipeline.pyFull Pipeline

Week 2 Complete Summary

You have completed the entire Week 2 curriculum. Here is everything you can now do:

📌

NumPy

Arrays, vectorised ops, boolean masking, np.mean/std/dot. Numerical backbone of all ML. 50x faster than Python lists.

📚

Pandas

Load CSV, select/filter, handle missing values, groupby. Standard data manipulation for every ML project.

📈

Visualisation

Matplotlib core charts, Seaborn statistical plots, 6-step EDA checklist. See patterns statistics cannot reveal.

⚙️

Full EDA Pipeline

Inspect → Clean → Visualise → Encode → Normalise → X and y. Week 3 ML training starts exactly here.

🎉

Week 2 Complete!

All of Week 2 mastered. Complete the EDA Lab — 5 hands-on tasks building the full pipeline. After the quiz, Week 3 begins: training real ML models with scikit-learn.

✓ Week 2 — Done

✏ EDA Lab — Next

▶ Week 3 ML — After Quiz

Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023