Topic 7 of 12

Pandas — Data Cleaning & Manipulation

Pandas is Excel on steroids. Handle millions of rows with just a few lines of code.

📚Intermediate
⏱️18 min
6 quizzes

What is Pandas?

Pandas is Python's most powerful library for data manipulation and analysis. Think of it as Excel, but:

100x faster for large datasets ✅ Handles millions of rows without crashing ✅ Fully automated (no manual clicking) ✅ More powerful operations than Excel formulas ✅ Industry standard used by data teams worldwide

code.pyPython
import pandas as pd
import numpy as np

Real-world fact: 90% of data analysts who know Python use Pandas daily.


DataFrames — The Core Structure

A DataFrame is like an Excel spreadsheet in Python — rows and columns of data.

Creating a DataFrame

code.pyPython
# From a dictionary
data = {
    'name': ['Rahul', 'Priya', 'Amit', 'Sneha'],
    'age': [25, 28, 24, 26],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai'],
    'salary': [50000, 60000, 45000, 55000]
}

df = pd.DataFrame(data)
print(df)

Output: | | name | age | city | salary | |---|------|-----|----------|--------| | 0 | Rahul | 25 | Mumbai | 50000 | | 1 | Priya | 28 | Delhi | 60000 | | 2 | Amit | 24 | Bangalore | 45000 | | 3 | Sneha | 26 | Mumbai | 55000 |


Reading Data from Files

Read CSV

code.pyPython
# Basic read
df = pd.read_csv('sales_data.csv')

# With options
df = pd.read_csv('data.csv',
                 sep=',',           # Delimiter
                 encoding='utf-8',  # Encoding
                 na_values=['NA', 'Missing'])  # Treat these as null

Read Excel

code.pyPython
# Single sheet
df = pd.read_excel('data.xlsx', sheet_name='Sales')

# Multiple sheets
sales_df = pd.read_excel('data.xlsx', sheet_name='Sales')
products_df = pd.read_excel('data.xlsx', sheet_name='Products')

Read from SQL Database

code.pyPython
import sqlite3

conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM orders WHERE date >= "2026-01-01"', conn)

Exploring Your Data

Quick Overview

code.pyPython
# First 5 rows
df.head()

# Last 5 rows
df.tail()

# Random 5 rows
df.sample(5)

# Shape (rows, columns)
print(df.shape)  # (1000, 8)

# Column names
print(df.columns)

# Data types
print(df.dtypes)

# Summary statistics
print(df.describe())

# Detailed info
df.info()

Example Output of df.info():

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 1000 non-null object 1 age 998 non-null float64 2 city 995 non-null object 3 salary 1000 non-null int64

Selecting Data

Select Columns

code.pyPython
# Single column (returns Series)
ages = df['age']

# Multiple columns (returns DataFrame)
subset = df[['name', 'salary']]

# All columns except one
df.drop('age', axis=1)

Select Rows by Position

code.pyPython
# First row
df.iloc[0]

# First 10 rows
df.iloc[0:10]

# Specific rows
df.iloc[[0, 5, 10]]

# Rows 10-20, columns 2-4
df.iloc[10:20, 2:4]

Select Rows by Label

code.pyPython
# Set index first
df_indexed = df.set_index('name')

# Select by index
df_indexed.loc['Rahul']

Filtering Data

Single Condition

code.pyPython
# High earners
high_earners = df[df['salary'] > 50000]

# From Mumbai
mumbai_people = df[df['city'] == 'Mumbai']

# Age below 25
young = df[df['age'] < 25]

Multiple Conditions (AND)

code.pyPython
# Young AND high earners
young_rich = df[(df['age'] < 30) & (df['salary'] > 50000)]

Multiple Conditions (OR)

code.pyPython
# Mumbai OR Delhi
target_cities = df[(df['city'] == 'Mumbai') | (df['city'] == 'Delhi')]

Using .isin()

code.pyPython
# Multiple cities
cities = ['Mumbai', 'Delhi', 'Bangalore']
metro_people = df[df['city'].isin(cities)]

String Contains

code.pyPython
# Names starting with 'R'
r_names = df[df['name'].str.startswith('R')]

# Cities containing 'ore'
ore_cities = df[df['city'].str.contains('ore')]

Data Cleaning

1. Handling Missing Values

code.pyPython
# Check for missing values
print(df.isnull().sum())

# Visualize missing data
print(df.isnull().sum() / len(df) * 100)  # Percentage

# Drop rows with ANY missing value
df_clean = df.dropna()

# Drop rows where specific column is null
df_clean = df.dropna(subset=['email'])

# Drop columns with too many nulls
df_clean = df.dropna(axis=1, thresh=500)  # Keep cols with 500+ non-null

# Fill missing values
df['age'].fillna(df['age'].mean(), inplace=True)  # Fill with mean
df['city'].fillna('Unknown', inplace=True)        # Fill with constant
df['salary'].fillna(method='ffill', inplace=True) # Forward fill

2. Removing Duplicates

code.pyPython
# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Remove all duplicates
df_unique = df.drop_duplicates()

# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['email'], keep='first')

3. Cleaning Text Data

code.pyPython
# Convert to lowercase
df['name'] = df['name'].str.lower()

# Remove leading/trailing whitespace
df['name'] = df['name'].str.strip()

# Replace text
df['city'] = df['city'].str.replace('Banglore', 'Bangalore')

# Remove special characters
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)  # Keep only digits

4. Converting Data Types

code.pyPython
# Convert to numeric
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Convert to categorical (saves memory)
df['city'] = df['city'].astype('category')

Sorting Data

code.pyPython
# Sort by one column
df_sorted = df.sort_values('salary', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['city', 'salary'], ascending=[True, False])

# Reset index after sorting
df_sorted = df_sorted.reset_index(drop=True)

Creating New Columns

Simple Calculations

code.pyPython
# Add new column
df['salary_with_gst'] = df['salary'] * 1.18

# Multiple columns
df['annual_salary'] = df['salary'] * 12

Conditional Columns

code.pyPython
# Using np.where (if-else)
df['category'] = np.where(df['salary'] > 50000, 'High', 'Low')

# Using apply with lambda
df['age_group'] = df['age'].apply(
    lambda x: 'Young' if x < 25 else ('Mid' if x < 35 else 'Senior')
)

# Using conditions
conditions = [
    df['salary'] > 70000,
    df['salary'] > 50000,
    df['salary'] > 30000
]
choices = ['Platinum', 'Gold', 'Silver']
df['tier'] = np.select(conditions, choices, default='Bronze')

GroupBy — Aggregations

Basic GroupBy

code.pyPython
# Average salary by city
city_avg = df.groupby('city')['salary'].mean()

# Count by city
city_count = df.groupby('city').size()

# Multiple aggregations
city_stats = df.groupby('city')['salary'].agg(['mean', 'min', 'max', 'count'])

Advanced GroupBy

code.pyPython
# Multiple columns
multi_group = df.groupby(['city', 'age_group'])['salary'].mean()

# Multiple aggregations on different columns
summary = df.groupby('city').agg({
    'salary': ['mean', 'sum'],
    'age': 'mean',
    'name': 'count'
}).reset_index()

# Custom aggregation
df.groupby('city')['salary'].agg(
    avg_salary='mean',
    total_salary='sum',
    salary_range=lambda x: x.max() - x.min()
)

Merging DataFrames

Sample Data

code.pyPython
# Customers
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Rahul', 'Priya', 'Amit']
})

# Orders
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 2],
    'amount': [5000, 3000, 2000]
})

Inner Join

code.pyPython
merged = pd.merge(customers, orders, on='customer_id', how='inner')

Left Join

code.pyPython
merged = pd.merge(customers, orders, on='customer_id', how='left')

Multiple Keys

code.pyPython
merged = pd.merge(df1, df2, on=['customer_id', 'date'], how='inner')

Pivot Tables

code.pyPython
# Create pivot table
pivot = df.pivot_table(
    values='salary',
    index='city',
    columns='age_group',
    aggfunc='mean'
)

# Multiple aggregations
pivot_multi = df.pivot_table(
    values='salary',
    index='city',
    aggfunc=['mean', 'count', 'sum']
)

Real-World Example: Sales Analysis

code.pyPython
import pandas as pd

# Read sales data
sales = pd.read_csv('sales_2026.csv')

# 1. Data Cleaning
sales = sales.dropna(subset=['amount', 'customer_id'])
sales = sales.drop_duplicates()
sales['date'] = pd.to_datetime(sales['date'])
sales['city'] = sales['city'].str.strip().str.title()

# 2. Feature Engineering
sales['month'] = sales['date'].dt.month
sales['quarter'] = sales['date'].dt.quarter
sales['is_high_value'] = sales['amount'] > 10000

# 3. Filter for Q1 2026
q1_sales = sales[
    (sales['date'] >= '2026-01-01') &
    (sales['date'] <= '2026-03-31')
]

# 4. Aggregations
revenue_by_city = q1_sales.groupby('city').agg({
    'amount': 'sum',
    'order_id': 'count'
}).rename(columns={'amount': 'total_revenue', 'order_id': 'order_count'})

# 5. Sort and get top 10 cities
top_cities = revenue_by_city.sort_values('total_revenue', ascending=False).head(10)

# 6. Export results
top_cities.to_csv('top_cities_q1_2026.csv')
print("Analysis complete!")

Exporting Data

code.pyPython
# To CSV
df.to_csv('output.csv', index=False)

# To Excel
df.to_excel('output.xlsx', sheet_name='Sales', index=False)

# To SQL
df.to_sql('sales_table', conn, if_exists='replace', index=False)

# To JSON
df.to_json('output.json', orient='records')

Common Mistakes & Solutions

❌ Mistake 1: SettingWithCopyWarning

Problem:

code.pyPython
df[df['age'] > 25]['salary'] = 60000  # Warning!

Solution:

code.pyPython
df.loc[df['age'] > 25, 'salary'] = 60000  # Correct

❌ Mistake 2: Forgetting inplace=True

Problem:

code.pyPython
df.dropna()  # Doesn't change df

Solution:

code.pyPython
df = df.dropna()  # Reassign
# OR
df.dropna(inplace=True)  # Modify in place

❌ Mistake 3: Wrong Operator for Conditions

Problem:

code.pyPython
df[df['age'] > 25 and df['salary'] > 50000]  # Error!

Solution:

code.pyPython
df[(df['age'] > 25) & (df['salary'] > 50000)]  # Correct

Summary

DataFrames are like Excel tables in Python ✅ Read data from CSV, Excel, SQL with pd.read_*Filter rows: df[df['col'] > value]Clean data: dropna(), drop_duplicates(), fillna() ✅ Create columns with calculations and conditions ✅ GroupBy for aggregations: df.groupby('col').agg()Merge DataFrames with pd.merge() ✅ Export results to CSV, Excel, SQL

Next Topic: Data Visualization Principles

Ready to turn your data into beautiful charts? Let's go! 📊