Synkoc AI/ML Internship · Week 2 · Lesson 4 of 13
Pandas
DataFrames & Data Cleaning
The most important data science library after NumPy. Load CSV files, clean messy data, filter rows, group by categories, and prepare any dataset for ML in minutes.
📚 DataFrame
💾 Load CSV
🧹 Clean Data
📉 GroupBy
🧑💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏳ ~55 minutes
🔵 Core Tool
Why Pandas is Essential
Real-world data lives in CSV files, Excel sheets, and databases. Pandas gives every column a name, every row an index, and provides tools to clean, transform, and analyse tabular data before training any ML model.
📚
DataFrame
A 2D table with labelled rows and columns. Like a spreadsheet but in Python code.
df['age'] not X[:,0]
💾
Read CSV
One line to load any CSV file. Pandas infers column names and data types automatically.
pd.read_csv("data.csv")
🧹
Missing Values
Real datasets have gaps. Pandas detects and fills missing values. 80% of ML project time is here.
fillna(), dropna()
📉
GroupBy
Compute statistics per category. Average score per class. Total sales per region. One line.
df.groupby('class').mean()
Chapter 1 of 4
01
Creating DataFrames
From dictionaries, lists, and CSV files. Understanding index, columns, dtypes, and the describe function that summarises your entire dataset in one call.
Creating & Loading DataFrames
Three ways to create a DataFrame. The CSV method is how you will load every real dataset you work with.
📚
From a dictionary
Each key becomes a column name. Each list of values becomes the column data. Same length required.
df = pd.DataFrame({
"name": ["Priya","Rahul","Anjali"],
"score": [92, 75, 96],
"grade": ["A","B","A"]
})
⚡ Perfect for testing and small datasets
💾
From CSV file
One line loads any CSV. Pandas reads the header row as column names automatically.
df = pd.read_csv("students.csv")
print(df.shape) # (rows, cols)
print(df.columns) # column names
print(df.dtypes) # data types
print(df.head()) # first 5 rows
⚡ Every Kaggle dataset starts with pd.read_csv()
📉
df.describe()
One call returns count, mean, std, min, quartiles, and max for every numeric column. The statistics from Week 1 — all at once.
df.describe()
# count 3.0
# mean 87.7
# std 11.1
# min 75.0
# max 96.0
⚡ First thing to run on any new dataset
📊
df.info()
Shows column names, data types, non-null count, and memory usage. Essential for detecting missing values.
df.info()
# Column Non-Null Dtype
# name 3 non-null object
# score 3 non-null int64
# grade 3 non-null object
⚡ Non-Null count reveals missing data immediately
Chapter 2 of 4
02
Selecting & Filtering
Access columns by name. Filter rows by condition. Select specific subsets of your data. The Pandas equivalents of NumPy indexing — but with readable column names.
Selecting & Filtering Data
Select columns by name, not by number. Filter rows by readable conditions. Combine filters with & and |. This is how you prepare ML features and labels.
📉
df['column'] and df[condition]
Square brackets with a column name returns a Series — one column of data. Square brackets with a boolean condition filters rows. Use double brackets for multiple columns. The loc accessor uses labels, iloc uses integers.
df['score'] # one column (Series)
df[['name','score']] # multiple columns
df[df['score'] >= 90] # rows where score>=90
df[df['grade'] == 'A'] # rows where grade=A
# Combine conditions:
df[(df['score']>75) & (df['grade']=='A')]
⚡ML Connection: X = df[feature_cols] extracts your feature matrix. y = df['target'] extracts your labels. This is the standard way to separate X and y before calling sklearn's train_test_split.
Chapter 3 of 4
03
Cleaning Missing Data
Real datasets always have missing values. Detecting and handling them is 80% of a data scientist's work. Three strategies — drop, fill with mean, fill with mode.
Handling Missing Values
NaN means Not a Number — a missing value. ML models cannot handle NaN. You must detect and handle every missing value before training.
🔍
Detect missing values
isnull() returns True where values are missing. Sum counts missing values per column.
df.isnull().sum()
# name 0
# score 2 <-- 2 missing!
# grade 1 <-- 1 missing!
⚡ Always run this on any new dataset
🗑️
Drop missing rows
dropna() removes any row containing a NaN. Use when missing data is rare and random.
df_clean = df.dropna()
# Removes rows with ANY missing
df.dropna(subset=['score'])
# Only drop if 'score' is missing
⚡ Dangerous if data is scarce — use carefully
📋
Fill with mean/median
fillna fills missing numeric values with the mean or median. Median is better when outliers exist.
mean_score = df['score'].mean()
df['score'].fillna(mean_score)
# Or for skewed data:
df['score'].fillna(df['score'].median())
⚡ Standard imputation strategy in ML pipelines
🆕
Fill with mode
For categorical columns, fill missing values with the most frequent category (mode).
mode_grade = df['grade'].mode()[0]
df['grade'].fillna(mode_grade)
# mode() returns a Series
# [0] gets the top mode value
⚡ Standard for categorical features
Chapter 4 of 4
04
GroupBy & Aggregation
Compute statistics by group. Average score per class, total sales per region, count of students per grade. The foundation of feature engineering in ML.
Lesson Summary
You have completed Pandas. Here is what you can now do with any dataset:
📚
Create & Load DataFrames
pd.DataFrame() from dict, pd.read_csv() from files. df.head(), df.info(), df.describe() for instant analysis.
📉
Select & Filter
df['col'] for a column. df[condition] to filter rows. df[feature_cols] for X. df['target'] for y. Standard ML prep.
🧹
Clean Missing Data
df.isnull().sum() to detect. dropna() or fillna(mean) to fix. Always clean before training any ML model.
📉
GroupBy & Aggregate
df.groupby('col').mean() for group statistics. value_counts() for category counts. Foundation of feature engineering.
📚
Pandas Complete!
DataFrames mastered. Complete the Pandas Practical Lab with 5 tasks. Next: Data Visualisation — turning your Pandas data into charts with Matplotlib and Seaborn.
✓ Video — Done
✏ Practical Lab — Next
❓ Quiz — After Lab
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023