Chapters:
Synkoc AI/ML Internship · Week 2 · Lesson 4 of 13
Pandas
DataFrames & Data Cleaning
The most important data science library after NumPy. Load CSV files, clean messy data, filter rows, group by categories, and prepare any dataset for ML in minutes.
📚 DataFrame
💾 Load CSV
🧹 Clean Data
📉 GroupBy
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏳ ~55 minutes
🔵 Core Tool
Why Pandas is Essential
Real-world data lives in CSV files, Excel sheets, and databases. Pandas gives every column a name, every row an index, and provides tools to clean, transform, and analyse tabular data before training any ML model.
📚
DataFrame
A 2D table with labelled rows and columns. Like a spreadsheet but in Python code.
df['age'] not X[:,0]
💾
Read CSV
One line to load any CSV file. Pandas infers column names and data types automatically.
pd.read_csv("data.csv")
🧹
Missing Values
Real datasets have gaps. Pandas detects and fills missing values. 80% of ML project time is here.
fillna(), dropna()
📉
GroupBy
Compute statistics per category. Average score per class. Total sales per region. One line.
df.groupby('class').mean()
Chapter 1 of 4
01
Creating DataFrames
From dictionaries, lists, and CSV files. Understanding index, columns, dtypes, and the describe function that summarises your entire dataset in one call.
Creating & Loading DataFrames
Three ways to create a DataFrame. The CSV method is how you will load every real dataset you work with.
📚
From a dictionary
Each key becomes a column name. Each list of values becomes the column data. Same length required.
df = pd.DataFrame({
"name": ["Priya","Rahul","Anjali"],
"score": [92, 75, 96],
"grade": ["A","B","A"]
})
⚡ Perfect for testing and small datasets
💾
From CSV file
One line loads any CSV. Pandas reads the header row as column names automatically.
df = pd.read_csv("students.csv")
print(df.shape)    # (rows, cols)
print(df.columns)  # column names
print(df.dtypes)   # data types
print(df.head())   # first 5 rows
⚡ Every Kaggle dataset starts with pd.read_csv()
📉
df.describe()
One call returns count, mean, std, min, quartiles, and max for every numeric column. The statistics from Week 1 — all at once.
df.describe()
# count 3.0
# mean 87.7
# std 11.1
# min 75.0
# max 96.0
⚡ First thing to run on any new dataset
📊
df.info()
Shows column names, data types, non-null count, and memory usage. Essential for detecting missing values.
df.info()
# Column Non-Null Dtype
# name 3 non-null object
# score 3 non-null int64
# grade 3 non-null object
⚡ Non-Null count reveals missing data immediately
Chapter 2 of 4
02
Selecting & Filtering
Access columns by name. Filter rows by condition. Select specific subsets of your data. The Pandas equivalents of NumPy indexing — but with readable column names.
Selecting & Filtering Data
Select columns by name, not by number. Filter rows by readable conditions. Combine filters with & and |. This is how you prepare ML features and labels.
📉
df['column'] and df[condition]
Square brackets with a column name returns a Series — one column of data. Square brackets with a boolean condition filters rows. Use double brackets for multiple columns. The loc accessor uses labels, iloc uses integers.
df['score']             # one column (Series)
df[['name','score']]     # multiple columns
df[df['score'] >= 90]    # rows where score>=90
df[df['grade'] == 'A']  # rows where grade=A
# Combine conditions:
df[(df['score']>75) & (df['grade']=='A')]
ML Connection: X = df[feature_cols] extracts your feature matrix. y = df['target'] extracts your labels. This is the standard way to separate X and y before calling sklearn's train_test_split.
pandas_basics.py● LIVE
Chapter 3 of 4
03
Cleaning Missing Data
Real datasets always have missing values. Detecting and handling them is 80% of a data scientist's work. Three strategies — drop, fill with mean, fill with mode.
Handling Missing Values
NaN means Not a Number — a missing value. ML models cannot handle NaN. You must detect and handle every missing value before training.
🔍
Detect missing values
isnull() returns True where values are missing. Sum counts missing values per column.
df.isnull().sum()
# name 0
# score 2 <-- 2 missing!
# grade 1 <-- 1 missing!
⚡ Always run this on any new dataset
🗑️
Drop missing rows
dropna() removes any row containing a NaN. Use when missing data is rare and random.
df_clean = df.dropna()
# Removes rows with ANY missing
df.dropna(subset=['score'])
# Only drop if 'score' is missing
⚡ Dangerous if data is scarce — use carefully
📋
Fill with mean/median
fillna fills missing numeric values with the mean or median. Median is better when outliers exist.
mean_score = df['score'].mean()
df['score'].fillna(mean_score)
# Or for skewed data:
df['score'].fillna(df['score'].median())
⚡ Standard imputation strategy in ML pipelines
🆕
Fill with mode
For categorical columns, fill missing values with the most frequent category (mode).
mode_grade = df['grade'].mode()[0]
df['grade'].fillna(mode_grade)
# mode() returns a Series
# [0] gets the top mode value
⚡ Standard for categorical features
Chapter 4 of 4
04
GroupBy & Aggregation
Compute statistics by group. Average score per class, total sales per region, count of students per grade. The foundation of feature engineering in ML.
pandas_groupby.py● LIVE
Lesson Summary
You have completed Pandas. Here is what you can now do with any dataset:
📚
Create & Load DataFrames
pd.DataFrame() from dict, pd.read_csv() from files. df.head(), df.info(), df.describe() for instant analysis.
📉
Select & Filter
df['col'] for a column. df[condition] to filter rows. df[feature_cols] for X. df['target'] for y. Standard ML prep.
🧹
Clean Missing Data
df.isnull().sum() to detect. dropna() or fillna(mean) to fix. Always clean before training any ML model.
📉
GroupBy & Aggregate
df.groupby('col').mean() for group statistics. value_counts() for category counts. Foundation of feature engineering.
📚
Pandas Complete!
DataFrames mastered. Complete the Pandas Practical Lab with 5 tasks. Next: Data Visualisation — turning your Pandas data into charts with Matplotlib and Seaborn.
✓ Video — Done
✏ Practical Lab — Next
❓ Quiz — After Lab
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023
Press ▶ Play to start the lesson with voice narration
0:00 / ~55:00
🔊
1 / 13