Pandas DataFrames | Week 2 | Synkoc AI/ML Internship

Synkoc AI/ML Internship · Week 2 · Lesson 4 of 13

Pandas
DataFrames & Data Cleaning

The most important data science library after NumPy. Load CSV files, clean messy data, filter rows, group by categories, and prepare any dataset for ML in minutes.

📚 DataFrame

💾 Load CSV

🧹 Clean Data

📉 GroupBy

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏳ ~55 minutes
🔵 Core Tool

Why Pandas is Essential

Real-world data lives in CSV files, Excel sheets, and databases. Pandas gives every column a name, every row an index, and provides tools to clean, transform, and analyse tabular data before training any ML model.

📚

DataFrame

A 2D table with labelled rows and columns. Like a spreadsheet but in Python code.

df['age'] not X[:,0]

💾

Read CSV

One line to load any CSV file. Pandas infers column names and data types automatically.

pd.read_csv("data.csv")

🧹

Missing Values

Real datasets have gaps. Pandas detects and fills missing values. 80% of ML project time is here.

fillna(), dropna()

📉

GroupBy

Compute statistics per category. Average score per class. Total sales per region. One line.

df.groupby('class').mean()

Chapter 1 of 4

Creating DataFrames

From dictionaries, lists, and CSV files. Understanding index, columns, dtypes, and the describe function that summarises your entire dataset in one call.

Creating & Loading DataFrames

Three ways to create a DataFrame. The CSV method is how you will load every real dataset you work with.

📚

From a dictionary

Each key becomes a column name. Each list of values becomes the column data. Same length required.

df = pd.DataFrame({
"name": ["Priya","Rahul","Anjali"],
"score": [92, 75, 96],
"grade": ["A","B","A"]
})

⚡ Perfect for testing and small datasets

💾

From CSV file

One line loads any CSV. Pandas reads the header row as column names automatically.

df = pd.read_csv("students.csv")
print(df.shape)    # (rows, cols)
print(df.columns)  # column names
print(df.dtypes)   # data types
print(df.head())   # first 5 rows

⚡ Every Kaggle dataset starts with pd.read_csv()

📉

df.describe()

One call returns count, mean, std, min, quartiles, and max for every numeric column. The statistics from Week 1 — all at once.

df.describe()
# count 3.0
# mean 87.7
# std 11.1
# min 75.0
# max 96.0

⚡ First thing to run on any new dataset

📊

df.info()

Shows column names, data types, non-null count, and memory usage. Essential for detecting missing values.

df.info()
# Column Non-Null Dtype
# name 3 non-null object
# score 3 non-null int64
# grade 3 non-null object

⚡ Non-Null count reveals missing data immediately

Chapter 2 of 4

Selecting & Filtering

Access columns by name. Filter rows by condition. Select specific subsets of your data. The Pandas equivalents of NumPy indexing — but with readable column names.

Selecting & Filtering Data

Select columns by name, not by number. Filter rows by readable conditions. Combine filters with & and |. This is how you prepare ML features and labels.

📉

df['column'] and df[condition]

Square brackets with a column name returns a Series — one column of data. Square brackets with a boolean condition filters rows. Use double brackets for multiple columns. The loc accessor uses labels, iloc uses integers.

df['score']             # one column (Series)
df[['name','score']]     # multiple columns
df[df['score'] >= 90]    # rows where score>=90
df[df['grade'] == 'A']  # rows where grade=A
# Combine conditions:
df[(df['score']>75) & (df['grade']=='A')]

⚡ML Connection: X = df[feature_cols] extracts your feature matrix. y = df['target'] extracts your labels. This is the standard way to separate X and y before calling sklearn's train_test_split.

pandas_basics.py● LIVE

Chapter 3 of 4

Cleaning Missing Data

Real datasets always have missing values. Detecting and handling them is 80% of a data scientist's work. Three strategies — drop, fill with mean, fill with mode.

Handling Missing Values

NaN means Not a Number — a missing value. ML models cannot handle NaN. You must detect and handle every missing value before training.

🔍

Detect missing values

isnull() returns True where values are missing. Sum counts missing values per column.

df.isnull().sum()
# name 0
# score 2 <-- 2 missing!
# grade 1 <-- 1 missing!

⚡ Always run this on any new dataset

🗑️

Drop missing rows

dropna() removes any row containing a NaN. Use when missing data is rare and random.

df_clean = df.dropna()
# Removes rows with ANY missing
df.dropna(subset=['score'])
# Only drop if 'score' is missing

⚡ Dangerous if data is scarce — use carefully

📋

Fill with mean/median

fillna fills missing numeric values with the mean or median. Median is better when outliers exist.

mean_score = df['score'].mean()
df['score'].fillna(mean_score)
# Or for skewed data:
df['score'].fillna(df['score'].median())

⚡ Standard imputation strategy in ML pipelines

🆕

Fill with mode

For categorical columns, fill missing values with the most frequent category (mode).

mode_grade = df['grade'].mode()[0]
df['grade'].fillna(mode_grade)
# mode() returns a Series
# [0] gets the top mode value

⚡ Standard for categorical features

Chapter 4 of 4

GroupBy & Aggregation

Compute statistics by group. Average score per class, total sales per region, count of students per grade. The foundation of feature engineering in ML.

pandas_groupby.py● LIVE

Lesson Summary

You have completed Pandas. Here is what you can now do with any dataset:

📚

Create & Load DataFrames

pd.DataFrame() from dict, pd.read_csv() from files. df.head(), df.info(), df.describe() for instant analysis.

📉

Select & Filter

df['col'] for a column. df[condition] to filter rows. df[feature_cols] for X. df['target'] for y. Standard ML prep.

🧹

Clean Missing Data

df.isnull().sum() to detect. dropna() or fillna(mean) to fix. Always clean before training any ML model.

📉

GroupBy & Aggregate

df.groupby('col').mean() for group statistics. value_counts() for category counts. Foundation of feature engineering.

📚

Pandas Complete!

DataFrames mastered. Complete the Pandas Practical Lab with 5 tasks. Next: Data Visualisation — turning your Pandas data into charts with Matplotlib and Seaborn.

✓ Video — Done

✏ Practical Lab — Next

❓ Quiz — After Lab

Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023