Chapters:
Synkoc AI/ML Internship · Week 2 · Lesson 6 of 13
Exploratory
Data Analysis
The complete Week 2 pipeline. Load → Inspect → Clean → Visualise → Feature Engineer → Export X and y. Your standard workflow before any ML model.
🔎 Inspect
🧹 Clean
📈 Visualise
⚙️ Feature Eng
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏳ ~55 minutes
🟢 Week 2 Capstone
What is EDA?
EDA is the detective work of data science. Before you build any model, you must understand your data deeply. What is the shape? What types? What is missing? What are the distributions? What correlates with what?
🔎
Phase 1: Inspect
Load data. Check shape, dtypes, null counts. head(), info(), describe(). First 5 minutes.
Always run first
🧹
Phase 2: Clean
Handle missing values. Fix wrong types. Remove duplicates. Drop irrelevant columns.
fillna, dropna, astype
📈
Phase 3: Visualise
Histograms, heatmap, scatter, box plots. See what statistics cannot show.
matplotlib + seaborn
⚙️
Phase 4: Feature Eng
Encode categories. Normalise numerics. Create X and y. Ready for any sklearn model.
Output: X.shape, y.shape
Chapter 1 of 4
01
Phase 1: Inspect
Six commands that tell you everything about a raw dataset. Run these before touching anything. In every ML project, every time.
Inspection Commands
These six commands run in sequence give you a complete picture of any dataset before you make any decisions.
📊
df.shape + df.columns
How many rows and columns? What are the feature names? First check to orient yourself.
print(df.shape)    # (1000, 8)
print(df.columns)  # Index(['age'...])
print(df.dtypes)   # age int64...
⚡ dtypes: if numeric shows 'object' → astype(float)
📝
df.head() + df.info()
head() shows first 5 rows as a table. info() shows dtypes, null counts, memory usage.
df.head()   # first 5 rows
df.tail()   # last 5 rows
df.info()   # dtypes + nulls
df.sample(5) # random 5 rows
⚡ info() Non-Null count < total = missing!
📉
df.describe()
Returns count, mean, std, min, 25%, 50%, 75%, max for every numeric column. All stats from Week 1 at once.
df.describe()
#        age     score
# mean   25.4     77.2
# std     4.1     12.8
# max    45.0     98.0
⚡ max much higher than 75% = outlier!
🔍
df.isnull().sum()
Count missing values per column. Reveals which columns need imputation and how much data is missing.
df.isnull().sum()
# age     0
# score   23 <-- fix this!
# city    12 <-- fix this!
# passed 0
⚡ Above 30% missing → consider dropping column
Chapter 2 of 4
02
Phase 2: Clean
Handle every missing value. Fix data types. Remove duplicates. Garbage in, garbage out — cleaning is the single most impactful step in any ML project.
Data Cleaning Playbook
Four cleaning actions in order. Always inspect before cleaning. Never impute the target column.
📋
Impute missing values
Fill numeric NaN with mean or median. Fill categorical NaN with mode. Drop rows if target is missing.
df['score'].fillna(df['score'].mean(), inplace=True)
df['city'].fillna(df['city'].mode()[0], inplace=True)
df.dropna(subset=['passed'], inplace=True)
⚡ Use median instead of mean for skewed columns
🔧
Fix data types
Numeric stored as strings is extremely common after CSV loading. Use astype() to fix.
df['score'] = df['score'].astype(float)
df['age'] = df['age'].astype(int)
# Dates:
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
⚡ Always check dtypes after pd.read_csv()
🗑️
Remove duplicates
Duplicate rows inflate class counts and bias training. One line to detect and remove.
n_dups = df.duplicated().sum()
print(f"Duplicates: {n_dups}")
df = df.drop_duplicates()
print(f"Clean rows: {len(df)}")
⚡ Always check duplicates in real-world data
🎉
Encode categories
ML needs numbers. LabelEncoder converts text to integers. pd.get_dummies for one-hot encoding.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['city_enc'] = le.fit_transform(df['city'])
# Or one-hot encoding:
df = pd.get_dummies(df, columns=['city'])
⚡ Encode before separating X and y
eda_inspect_clean.py● LIVE
Chapter 3 of 4
03
Phase 3: Visualise
The 6-plot EDA checklist. Run every one before training any model. Each plot answers a specific question about your data that statistics alone cannot.
The 6-Plot EDA Checklist
Six visualisations that constitute a complete EDA. Each answers one critical question before you train any ML model.
1️⃣
Histogram per numeric column
plt.hist() on every numeric feature. Is distribution normal? Skewed? Bimodal? Decides normalisation strategy.
2️⃣
sns.countplot() on target
Is the target class balanced? If 95% class 0 and 5% class 1, you have class imbalance. Requires oversampling or weighted loss.
3️⃣
sns.heatmap(df.corr())
Correlation matrix. Keep features with |r|>0.5 vs target. Remove features with |r|>0.9 between each other (redundant).
4️⃣
plt.scatter() feature vs target
Confirms whether relationship is linear, non-linear, or absent. Decides which model family is appropriate.
5️⃣
sns.boxplot() per feature
Visualise outliers. The box shows IQR. Points beyond whiskers are outliers. Decide to keep, cap, or remove.
6️⃣
Missing value bar chart
df.isnull().sum().plot(kind='bar'). Instantly see which columns need imputation and how much data is missing.
Chapter 4 of 4
04
Phase 4: Feature Engineering
Produce the final X matrix and y labels. Normalise features. This is the handoff from data analysis to machine learning.
eda_complete_pipeline.pyFull Pipeline
Week 2 Complete Summary
You have completed the entire Week 2 curriculum. Here is everything you can now do:
📌
NumPy
Arrays, vectorised ops, boolean masking, np.mean/std/dot. Numerical backbone of all ML. 50x faster than Python lists.
📚
Pandas
Load CSV, select/filter, handle missing values, groupby. Standard data manipulation for every ML project.
📈
Visualisation
Matplotlib core charts, Seaborn statistical plots, 6-step EDA checklist. See patterns statistics cannot reveal.
⚙️
Full EDA Pipeline
Inspect → Clean → Visualise → Encode → Normalise → X and y. Week 3 ML training starts exactly here.
🎉
Week 2 Complete!
All of Week 2 mastered. Complete the EDA Lab — 5 hands-on tasks building the full pipeline. After the quiz, Week 3 begins: training real ML models with scikit-learn.
✓ Week 2 — Done
✏ EDA Lab — Next
▶ Week 3 ML — After Quiz
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023
Press ▶ Play to start the lesson with voice narration
0:00 / ~55:00
🔊
1 / 13