What is Pandas — Data Cleaning & Manipulation?

Master Pandas for data analysis: DataFrames, filtering, groupby, merging, data cleaning. Complete hands-on guide with real examples for analysts.

Is Pandas — Data Cleaning & Manipulation suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 18 min to complete and includes 6 interactive quizzes to test your understanding.

How long does it take to learn Pandas — Data Cleaning & Manipulation?

You can complete this topic in about 18 min. The topic is part 7 of 12 in our comprehensive Data Analytics Learning Path.

Pandas Tutorial: Data Cleaning & Manipulation for Analysts | Skillsetmaster

What is Pandas?

Pandas is Python's most powerful library for data manipulation and analysis. Think of it as Excel, but:

✅ 100x faster for large datasets ✅ Handles millions of rows without crashing ✅ Fully automated (no manual clicking) ✅ More powerful operations than Excel formulas ✅ Industry standard used by data teams worldwide

code.pyPython

import pandas as pd
import numpy as np

Real-world fact: 90% of data analysts who know Python use Pandas daily.

DataFrames — The Core Structure

A DataFrame is like an Excel spreadsheet in Python — rows and columns of data.

Creating a DataFrame

code.pyPython

# From a dictionary
data = {
    'name': ['Rahul', 'Priya', 'Amit', 'Sneha'],
    'age': [25, 28, 24, 26],
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai'],
    'salary': [50000, 60000, 45000, 55000]
}

df = pd.DataFrame(data)
print(df)

Output: | | name | age | city | salary | |---|------|-----|----------|--------| | 0 | Rahul | 25 | Mumbai | 50000 | | 1 | Priya | 28 | Delhi | 60000 | | 2 | Amit | 24 | Bangalore | 45000 | | 3 | Sneha | 26 | Mumbai | 55000 |

Reading Data from Files

Read CSV

code.pyPython

# Basic read
df = pd.read_csv('sales_data.csv')

# With options
df = pd.read_csv('data.csv',
                 sep=',',           # Delimiter
                 encoding='utf-8',  # Encoding
                 na_values=['NA', 'Missing'])  # Treat these as null

Read Excel

code.pyPython

# Single sheet
df = pd.read_excel('data.xlsx', sheet_name='Sales')

# Multiple sheets
sales_df = pd.read_excel('data.xlsx', sheet_name='Sales')
products_df = pd.read_excel('data.xlsx', sheet_name='Products')

Read from SQL Database

code.pyPython

import sqlite3

conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM orders WHERE date >= "2026-01-01"', conn)

Exploring Your Data

Quick Overview

code.pyPython

# First 5 rows
df.head()

# Last 5 rows
df.tail()

# Random 5 rows
df.sample(5)

# Shape (rows, columns)
print(df.shape)  # (1000, 8)

# Column names
print(df.columns)

# Data types
print(df.dtypes)

# Summary statistics
print(df.describe())

# Detailed info
df.info()

Example Output of `df.info()`:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   name     1000 non-null   object
 1   age      998 non-null    float64
 2   city     995 non-null    object
 3   salary   1000 non-null   int64

Selecting Data

Select Columns

code.pyPython

# Single column (returns Series)
ages = df['age']

# Multiple columns (returns DataFrame)
subset = df[['name', 'salary']]

# All columns except one
df.drop('age', axis=1)

Select Rows by Position

code.pyPython

# First row
df.iloc[0]

# First 10 rows
df.iloc[0:10]

# Specific rows
df.iloc[[0, 5, 10]]

# Rows 10-20, columns 2-4
df.iloc[10:20, 2:4]

Select Rows by Label

code.pyPython

# Set index first
df_indexed = df.set_index('name')

# Select by index
df_indexed.loc['Rahul']

Filtering Data

Single Condition

code.pyPython

# High earners
high_earners = df[df['salary'] > 50000]

# From Mumbai
mumbai_people = df[df['city'] == 'Mumbai']

# Age below 25
young = df[df['age'] < 25]

Multiple Conditions (AND)

code.pyPython

# Young AND high earners
young_rich = df[(df['age'] < 30) & (df['salary'] > 50000)]

Multiple Conditions (OR)

code.pyPython

# Mumbai OR Delhi
target_cities = df[(df['city'] == 'Mumbai') | (df['city'] == 'Delhi')]

Using .isin()

code.pyPython

# Multiple cities
cities = ['Mumbai', 'Delhi', 'Bangalore']
metro_people = df[df['city'].isin(cities)]

String Contains

code.pyPython

# Names starting with 'R'
r_names = df[df['name'].str.startswith('R')]

# Cities containing 'ore'
ore_cities = df[df['city'].str.contains('ore')]

Data Cleaning

1. Handling Missing Values

code.pyPython

# Check for missing values
print(df.isnull().sum())

# Visualize missing data
print(df.isnull().sum() / len(df) * 100)  # Percentage

# Drop rows with ANY missing value
df_clean = df.dropna()

# Drop rows where specific column is null
df_clean = df.dropna(subset=['email'])

# Drop columns with too many nulls
df_clean = df.dropna(axis=1, thresh=500)  # Keep cols with 500+ non-null

# Fill missing values
df['age'].fillna(df['age'].mean(), inplace=True)  # Fill with mean
df['city'].fillna('Unknown', inplace=True)        # Fill with constant
df['salary'].fillna(method='ffill', inplace=True) # Forward fill

2. Removing Duplicates

code.pyPython

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Remove all duplicates
df_unique = df.drop_duplicates()

# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['email'], keep='first')

3. Cleaning Text Data

code.pyPython

# Convert to lowercase
df['name'] = df['name'].str.lower()

# Remove leading/trailing whitespace
df['name'] = df['name'].str.strip()

# Replace text
df['city'] = df['city'].str.replace('Banglore', 'Bangalore')

# Remove special characters
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)  # Keep only digits

4. Converting Data Types

code.pyPython

# Convert to numeric
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])

# Convert to categorical (saves memory)
df['city'] = df['city'].astype('category')

Sorting Data

code.pyPython

# Sort by one column
df_sorted = df.sort_values('salary', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['city', 'salary'], ascending=[True, False])

# Reset index after sorting
df_sorted = df_sorted.reset_index(drop=True)

Creating New Columns

Simple Calculations

code.pyPython

# Add new column
df['salary_with_gst'] = df['salary'] * 1.18

# Multiple columns
df['annual_salary'] = df['salary'] * 12

Conditional Columns

code.pyPython

# Using np.where (if-else)
df['category'] = np.where(df['salary'] > 50000, 'High', 'Low')

# Using apply with lambda
df['age_group'] = df['age'].apply(
    lambda x: 'Young' if x < 25 else ('Mid' if x < 35 else 'Senior')
)

# Using conditions
conditions = [
    df['salary'] > 70000,
    df['salary'] > 50000,
    df['salary'] > 30000
]
choices = ['Platinum', 'Gold', 'Silver']
df['tier'] = np.select(conditions, choices, default='Bronze')

GroupBy — Aggregations

Basic GroupBy

code.pyPython

# Average salary by city
city_avg = df.groupby('city')['salary'].mean()

# Count by city
city_count = df.groupby('city').size()

# Multiple aggregations
city_stats = df.groupby('city')['salary'].agg(['mean', 'min', 'max', 'count'])

Advanced GroupBy

code.pyPython

# Multiple columns
multi_group = df.groupby(['city', 'age_group'])['salary'].mean()

# Multiple aggregations on different columns
summary = df.groupby('city').agg({
    'salary': ['mean', 'sum'],
    'age': 'mean',
    'name': 'count'
}).reset_index()

# Custom aggregation
df.groupby('city')['salary'].agg(
    avg_salary='mean',
    total_salary='sum',
    salary_range=lambda x: x.max() - x.min()
)

Merging DataFrames

Sample Data

code.pyPython

# Customers
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Rahul', 'Priya', 'Amit']
})

# Orders
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 2],
    'amount': [5000, 3000, 2000]
})

Inner Join

code.pyPython

merged = pd.merge(customers, orders, on='customer_id', how='inner')

Left Join

code.pyPython

merged = pd.merge(customers, orders, on='customer_id', how='left')

Multiple Keys

code.pyPython

merged = pd.merge(df1, df2, on=['customer_id', 'date'], how='inner')

Pivot Tables

code.pyPython

# Create pivot table
pivot = df.pivot_table(
    values='salary',
    index='city',
    columns='age_group',
    aggfunc='mean'
)

# Multiple aggregations
pivot_multi = df.pivot_table(
    values='salary',
    index='city',
    aggfunc=['mean', 'count', 'sum']
)

Real-World Example: Sales Analysis

code.pyPython

import pandas as pd

# Read sales data
sales = pd.read_csv('sales_2026.csv')

# 1. Data Cleaning
sales = sales.dropna(subset=['amount', 'customer_id'])
sales = sales.drop_duplicates()
sales['date'] = pd.to_datetime(sales['date'])
sales['city'] = sales['city'].str.strip().str.title()

# 2. Feature Engineering
sales['month'] = sales['date'].dt.month
sales['quarter'] = sales['date'].dt.quarter
sales['is_high_value'] = sales['amount'] > 10000

# 3. Filter for Q1 2026
q1_sales = sales[
    (sales['date'] >= '2026-01-01') &
    (sales['date'] <= '2026-03-31')
]

# 4. Aggregations
revenue_by_city = q1_sales.groupby('city').agg({
    'amount': 'sum',
    'order_id': 'count'
}).rename(columns={'amount': 'total_revenue', 'order_id': 'order_count'})

# 5. Sort and get top 10 cities
top_cities = revenue_by_city.sort_values('total_revenue', ascending=False).head(10)

# 6. Export results
top_cities.to_csv('top_cities_q1_2026.csv')
print("Analysis complete!")

Exporting Data

code.pyPython

# To CSV
df.to_csv('output.csv', index=False)

# To Excel
df.to_excel('output.xlsx', sheet_name='Sales', index=False)

# To SQL
df.to_sql('sales_table', conn, if_exists='replace', index=False)

# To JSON
df.to_json('output.json', orient='records')

Common Mistakes & Solutions

❌ Mistake 1: SettingWithCopyWarning

Problem:

code.pyPython

df[df['age'] > 25]['salary'] = 60000  # Warning!

Solution:

code.pyPython

df.loc[df['age'] > 25, 'salary'] = 60000  # Correct

❌ Mistake 2: Forgetting inplace=True

Problem:

code.pyPython

df.dropna()  # Doesn't change df

Solution:

code.pyPython

df = df.dropna()  # Reassign
# OR
df.dropna(inplace=True)  # Modify in place

❌ Mistake 3: Wrong Operator for Conditions

Problem:

code.pyPython

df[df['age'] > 25 and df['salary'] > 50000]  # Error!

Solution:

code.pyPython

df[(df['age'] > 25) & (df['salary'] > 50000)]  # Correct

Summary

✅ DataFrames are like Excel tables in Python ✅ Read data from CSV, Excel, SQL with pd.read_* ✅ Filter rows: df[df['col'] > value] ✅ Clean data: dropna(), drop_duplicates(), fillna() ✅ Create columns with calculations and conditions ✅ GroupBy for aggregations: df.groupby('col').agg() ✅ Merge DataFrames with pd.merge() ✅ Export results to CSV, Excel, SQL

Next Topic: Data Visualization Principles

Ready to turn your data into beautiful charts? Let's go! 📊

Pandas — Data Cleaning & Manipulation

What is Pandas?

DataFrames — The Core Structure

Creating a DataFrame

Reading Data from Files

Read CSV

Read Excel

Read from SQL Database

Exploring Your Data

Quick Overview

Example Output of df.info():

Selecting Data

Select Columns

Select Rows by Position

Select Rows by Label

Filtering Data

Single Condition

Multiple Conditions (AND)

Multiple Conditions (OR)

Using .isin()

String Contains

Data Cleaning

1. Handling Missing Values

2. Removing Duplicates

3. Cleaning Text Data

4. Converting Data Types

Sorting Data

Creating New Columns

Simple Calculations

Conditional Columns

GroupBy — Aggregations

Basic GroupBy

Advanced GroupBy

Merging DataFrames

Sample Data

Inner Join

Left Join

Multiple Keys

Pivot Tables

Real-World Example: Sales Analysis

Exporting Data

Common Mistakes & Solutions

❌ Mistake 1: SettingWithCopyWarning

❌ Mistake 2: Forgetting inplace=True

❌ Mistake 3: Wrong Operator for Conditions

Summary

Example Output of `df.info()`: