groupby — Aggregating Data Like SQL GROUP BY
Pandas groupby() is like SQL's GROUP BY — it splits data into groups based on column values, applies aggregation functions, and combines results.
Basic Grouping and Aggregation
import pandas as pd
# Swiggy orders dataset
orders = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006],
'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Delhi', 'Mumbai'],
'restaurant': ['Dominos', 'KFC', 'Dominos', 'McDonalds', 'KFC', 'Subway'],
'amount': [450, 650, 380, 720, 590, 420],
'delivery_time': [35, 45, 30, 50, 40, 32]
})
# Total revenue by city (like SQL: SELECT city, SUM(amount) FROM orders GROUP BY city)
revenue_by_city = orders.groupby('city')['amount'].sum()
print(revenue_by_city)
# city
# Bangalore 720
# Delhi 1240
# Mumbai 1250Multiple Aggregations
# Multiple metrics per group
city_stats = orders.groupby('city')['amount'].agg(['sum', 'mean', 'count'])
print(city_stats)Output:
sum mean count
city
Bangalore 720 720.000000 1
Delhi 1240 620.000000 2
Mumbai 1250 416.666667 3
Aggregating Multiple Columns
# Different functions for different columns
city_stats = orders.groupby('city').agg({
'amount': ['sum', 'mean'],
'delivery_time': ['mean', 'max'],
'order_id': 'count'
})
print(city_stats)Named Aggregations (Cleaner Column Names)
# More readable output with custom column names
city_stats = orders.groupby('city').agg(
total_revenue=('amount', 'sum'),
avg_order_value=('amount', 'mean'),
avg_delivery=('delivery_time', 'mean'),
order_count=('order_id', 'count')
).reset_index()
print(city_stats)Output:
city total_revenue avg_order_value avg_delivery order_count
0 Bangalore 720 720.000000 50.0 1
1 Delhi 1240 620.000000 42.5 2
2 Mumbai 1250 416.666667 32.3 3
Grouping by Multiple Columns
# Revenue by city AND restaurant
city_restaurant = orders.groupby(['city', 'restaurant'])['amount'].sum()
print(city_restaurant)
# city restaurant
# Bangalore McDonalds 720
# Delhi KFC 1240
# Mumbai Dominos 830
# Subway 420Zomato analyzes restaurant performance by city to decide where to expand. They groupby city to find total orders, average order value, and delivery times. Mumbai might have high volume but low AOV, while Bangalore has lower volume but 2x AOV.
Advanced groupby Techniques
Beyond basic aggregations, groupby unlocks powerful patterns for complex analysis.
Custom Aggregation Functions
# Define custom function
def price_range(series):
return series.max() - series.min()
city_stats = orders.groupby('city')['amount'].agg([
'mean',
('range', price_range), # Custom function
('std', 'std') # Standard deviation
])Filter Groups After Aggregation
# Only cities with more than 2 orders
active_cities = orders.groupby('city').filter(lambda x: len(x) > 2)
print(active_cities)
# Returns full rows, not aggregated — only for Mumbai (3 orders)Transform (Keep Original Shape)
transform() returns a result with the same shape as the original DataFrame — useful for adding group statistics to individual rows:
# Add city average to each row
orders['city_avg_amount'] = orders.groupby('city')['amount'].transform('mean')
print(orders[['order_id', 'city', 'amount', 'city_avg_amount']])Output:
order_id city amount city_avg_amount
0 1001 Mumbai 450 416.666667
1 1002 Delhi 650 620.000000
2 1003 Mumbai 380 416.666667
3 1004 Bangalore 720 720.000000
4 1005 Delhi 590 620.000000
5 1006 Mumbai 420 416.666667
Apply (Most Flexible)
apply() runs any function on each group:
# Get top 2 orders per city
top_2_per_city = orders.groupby('city').apply(lambda x: x.nlargest(2, 'amount')).reset_index(drop=True)
print(top_2_per_city)Cumulative Aggregations
# Running total within each city
orders = orders.sort_values(['city', 'order_id'])
orders['cumulative_revenue'] = orders.groupby('city')['amount'].cumsum()
print(orders[['order_id', 'city', 'amount', 'cumulative_revenue']])Performance Tip: For simple aggregations, use built-in functions ('sum', 'mean') instead of lambdas. Built-in functions are Cython-optimized and 10-100x faster on large datasets.
⚠️ CheckpointQuiz error: Missing or invalid options array
merge and join — Combining DataFrames
Pandas merge() combines DataFrames based on common columns — like SQL JOINs. Essential for combining customer data, orders, products, etc.
Inner Join (Only Matching Rows)
# Customers DataFrame
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104],
'name': ['Priya', 'Rahul', 'Anjali', 'Vikas'],
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai']
})
# Orders DataFrame
orders = pd.DataFrame({
'order_id': [1001, 1002, 1003],
'customer_id': [101, 102, 101],
'amount': [2500, 3200, 1800]
})
# Inner join: only customers who have orders
result = pd.merge(customers, orders, on='customer_id', how='inner')
print(result)Output:
customer_id name city order_id amount
0 101 Priya Mumbai 1001 2500
1 101 Priya Mumbai 1003 1800
2 102 Rahul Delhi 1002 3200
Left Join (All Rows from Left Table)
# Left join: all customers, with orders if they exist
result = pd.merge(customers, orders, on='customer_id', how='left')
print(result)Output:
customer_id name city order_id amount
0 101 Priya Mumbai 1001.0 2500.0
1 101 Priya Mumbai 1003.0 1800.0
2 102 Rahul Delhi 1002.0 3200.0
3 103 Anjali Bangalore NaN NaN
4 104 Vikas Chennai NaN NaN
Right Join and Outer Join
# Right join: all orders, with customer info if available
result = pd.merge(customers, orders, on='customer_id', how='right')
# Outer join: all customers and all orders (union)
result = pd.merge(customers, orders, on='customer_id', how='outer')Merging on Different Column Names
# If join columns have different names
customers = pd.DataFrame({
'cust_id': [101, 102, 103],
'name': ['Priya', 'Rahul', 'Anjali']
})
orders = pd.DataFrame({
'order_id': [1001, 1002],
'customer_id': [101, 102],
'amount': [2500, 3200]
})
result = pd.merge(customers, orders, left_on='cust_id', right_on='customer_id', how='inner')Merging on Multiple Columns
# Join on city AND date
df1 = pd.DataFrame({
'city': ['Mumbai', 'Delhi', 'Mumbai'],
'date': ['2026-03-01', '2026-03-01', '2026-03-02'],
'orders': [120, 95, 130]
})
df2 = pd.DataFrame({
'city': ['Mumbai', 'Delhi', 'Mumbai'],
'date': ['2026-03-01', '2026-03-01', '2026-03-02'],
'revenue': [45000, 38000, 52000]
})
result = pd.merge(df1, df2, on=['city', 'date'], how='inner')
print(result)Using .join() Method
.join() is a shortcut when joining on index:
# Set index and join
customers_indexed = customers.set_index('customer_id')
orders_indexed = orders.set_index('customer_id')
result = customers_indexed.join(orders_indexed, how='left')SQL: SELECT * FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id
Pandas: pd.merge(customers, orders, on='customer_id', how='inner')
The logic is identical — only the syntax changes.
Pivot Tables — Reshaping Data
Pivot tables transform row-based data into a cross-tabulated format — perfect for comparing groups across multiple dimensions.
Basic Pivot Table
# Sample data: orders by city and day
data = {
'date': ['2026-03-20', '2026-03-20', '2026-03-21', '2026-03-21', '2026-03-22', '2026-03-22'],
'city': ['Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi'],
'orders': [120, 95, 130, 102, 115, 98],
'revenue': [45000, 38000, 52000, 41000, 48000, 39000]
}
df = pd.DataFrame(data)
# Pivot: rows=city, columns=date, values=revenue
pivot = df.pivot(index='city', columns='date', values='revenue')
print(pivot)Output:
date 2026-03-20 2026-03-21 2026-03-22
city
Delhi 38000 41000 39000
Mumbai 45000 52000 48000
Pivot Table with Aggregation
pivot() requires unique index-column combinations. For aggregation, use pivot_table():
# If multiple entries per city-date, aggregate with sum
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc='sum')
# Multiple aggregations
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc=['sum', 'mean'])Multi-Level Pivots
# Rows=city, Columns=date, Values=orders AND revenue
pivot = df.pivot_table(index='city', columns='date', values=['orders', 'revenue'])
print(pivot)Filling Missing Values
# Fill NaN with 0 in pivot table
pivot = df.pivot_table(index='city', columns='date', values='revenue', fill_value=0)Adding Totals (Margins)
# Add row and column totals
pivot = df.pivot_table(index='city', columns='date', values='revenue', aggfunc='sum', margins=True)
# Adds 'All' row and column with totalsUnpivoting with melt()
melt() is the reverse of pivot — converts wide format to long format:
# Wide format
wide = pd.DataFrame({
'city': ['Mumbai', 'Delhi'],
'Jan': [45000, 38000],
'Feb': [52000, 41000],
'Mar': [48000, 39000]
})
# Convert to long format
long = wide.melt(id_vars='city', var_name='month', value_name='revenue')
print(long)Output:
city month revenue
0 Mumbai Jan 45000
1 Delhi Jan 38000
2 Mumbai Feb 52000
3 Delhi Feb 41000
4 Mumbai Mar 48000
5 Delhi Mar 39000
When to Use Pivot vs Groupby: Use groupby when you want a flat aggregation (one result per group). Use pivot when you want to reshape data into a cross-tab format (compare groups across two dimensions).
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}