Introduction
Pandas groupby and merge are the building blocks of data analysis in Python. They're how you answer questions like:
- Flipkart: "What's the average order value per category?"
- Swiggy: "How many unique customers ordered from each restaurant?"
- Zomato: "Join customer data with order data to analyze spending patterns"
If you know SQL's GROUP BY and JOIN, you already understand 80% of what groupby() and merge() do. If not, don't worry—we'll build from the ground up.
GroupBy: The Split-Apply-Combine Pattern
groupby() follows a simple pattern:
- Split data into groups based on column values
- Apply a function to each group (sum, mean, count, etc.)
- Combine results into a new DataFrame
Think of groupby like sorting laundry. You split clothes into groups (whites, colors, delicates). Then you apply the same operation to each group (wash). Finally, you combine the clean clothes back together. groupby() does this with data instead of clothes.
Basic Example:
import pandas as pd
# Flipkart sales data
sales = pd.DataFrame({
'product': ['iPhone', 'MacBook', 'iPhone', 'iPad', 'MacBook', 'iPad'],
'category': ['Phone', 'Laptop', 'Phone', 'Tablet', 'Laptop', 'Tablet'],
'sales': [50000, 80000, 48000, 35000, 75000, 32000]
})
# Calculate total sales per category
category_sales = sales.groupby('category')['sales'].sum()
print(category_sales)Output:
category
Laptop 155000
Phone 98000
Tablet 67000
Name: sales, dtype: int64
⚠️ CheckpointQuiz error: Missing or invalid options array
Common GroupBy Aggregations
Single Aggregation:
# Total sales per category
df.groupby('category')['sales'].sum()
# Average price per brand
df.groupby('brand')['price'].mean()
# Count orders per customer
df.groupby('customer_id')['order_id'].count()
# Count unique products per category
df.groupby('category')['product_id'].nunique()Multiple Aggregations with agg():
# Multiple metrics for sales
df.groupby('category')['sales'].agg(['sum', 'mean', 'count', 'min', 'max'])
# Different functions for different columns
df.groupby('category').agg({
'sales': 'sum',
'quantity': 'mean',
'customer_id': 'nunique'
})Example - Swiggy Restaurant Analysis:
# Restaurant performance by city
restaurant_stats = df.groupby(['city', 'restaurant']).agg({
'order_id': 'count', # Total orders
'order_value': ['mean', 'sum'], # Avg and total revenue
'delivery_time': 'mean', # Avg delivery time
'customer_id': 'nunique' # Unique customers
}).reset_index()
print(restaurant_stats.head())Output:
city restaurant order_id order_value_mean order_value_sum delivery_time_mean customer_id_nunique
0 Bangalore Truffles 1500 350 525000 25 890
1 Bangalore Empire 2000 280 560000 30 1200
2 Delhi Karim's 1800 420 756000 35 1100
Pro Tip: Use reset_index() after groupby to convert the grouped columns from index back to regular columns. This makes the DataFrame easier to work with for further analysis or visualization.
GroupBy with Multiple Columns
Group by multiple columns to create hierarchical groups.
Example - Sales by City and Category:
# Group by city, then category within each city
city_category_sales = df.groupby(['city', 'category'])['sales'].sum()
print(city_category_sales)Output:
city category
Bangalore Laptop 250000
Phone 180000
Tablet 95000
Delhi Laptop 320000
Phone 220000
Tablet 110000
Unstack for Better Visualization:
# Convert to pivot table format
pivot = city_category_sales.unstack(fill_value=0)
print(pivot)Output:
category Laptop Phone Tablet
city
Bangalore 250000 180000 95000
Delhi 320000 220000 110000
Filtering Groups with filter()
Keep only groups that meet a condition.
Example - Zomato High-Performing Restaurants:
# Keep only restaurants with >100 orders
high_volume = df.groupby('restaurant').filter(lambda x: len(x) > 100)
# Or using a more complex condition
profitable = df.groupby('restaurant').filter(
lambda x: x['revenue'].sum() > 50000 and x['rating'].mean() >= 4.0
)Merge: Joining DataFrames
merge() combines DataFrames based on common columns, just like SQL JOINs.
Types of Joins:
- inner: Keep only matching rows (intersection)
- left: Keep all left rows, match from right
- right: Keep all right rows, match from left
- outer: Keep all rows from both (union)
Basic Syntax:
# Merge on single column
result = df1.merge(df2, on='user_id', how='inner')
# Merge on multiple columns
result = df1.merge(df2, on=['user_id', 'product_id'], how='left')
# Merge on different column names
result = df1.merge(df2, left_on='customer_id', right_on='user_id', how='inner')Example: Flipkart Order + Customer Analysis
Scenario: You have separate DataFrames for orders and customers. Merge them to analyze spending by customer segment.
# Orders data
orders = pd.DataFrame({
'order_id': [1, 2, 3, 4, 5],
'customer_id': [101, 102, 101, 103, 102],
'order_value': [5000, 3000, 7000, 2000, 4500],
'order_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']
})
# Customer data
customers = pd.DataFrame({
'customer_id': [101, 102, 103, 104],
'customer_name': ['Rahul', 'Priya', 'Amit', 'Sneha'],
'city': ['Mumbai', 'Bangalore', 'Delhi', 'Mumbai'],
'segment': ['Premium', 'Regular', 'Premium', 'Regular']
})
# Merge to add customer details to orders
merged = orders.merge(customers, on='customer_id', how='left')
print(merged)Output:
order_id customer_id order_value order_date customer_name city segment
0 1 101 5000 2024-01-15 Rahul Mumbai Premium
1 2 102 3000 2024-01-16 Priya Bangalore Regular
2 3 101 7000 2024-01-17 Rahul Mumbai Premium
3 4 103 2000 2024-01-18 Amit Delhi Premium
4 5 102 4500 2024-01-19 Priya Bangalore Regular
Now analyze by segment:
# Total spending by customer segment
segment_spending = merged.groupby('segment')['order_value'].agg(['sum', 'mean', 'count'])
print(segment_spending)Output:
sum mean count
segment
Premium 14000 4666.67 3
Regular 7500 3750.00 2
Join Type Matters!
how='left'keeps all orders even if customer data is missing (NaN for customer columns)how='inner'only keeps orders with matching customer records- For most analytics, use
how='left'to avoid accidentally dropping data
Indicator for Join Diagnostics
Use indicator=True to see which rows came from left, right, or both.
merged = orders.merge(customers, on='customer_id', how='outer', indicator=True)
print(merged['_merge'].value_counts())Output:
both 5 # Matched in both DataFrames
left_only 0 # Only in orders
right_only 1 # Only in customers (customer 104 had no orders)
Real-World Pattern: GroupBy + Merge
Scenario: Calculate each customer's total spending and merge it back with customer details.
# Step 1: Aggregate order data
customer_spending = orders.groupby('customer_id').agg({
'order_value': 'sum',
'order_id': 'count'
}).rename(columns={'order_value': 'total_spent', 'order_id': 'order_count'}).reset_index()
print(customer_spending)Output:
customer_id total_spent order_count
0 101 12000 2
1 102 7500 2
2 103 2000 1
# Step 2: Merge with customer data
customer_profile = customers.merge(customer_spending, on='customer_id', how='left')
customer_profile['total_spent'] = customer_profile['total_spent'].fillna(0)
customer_profile['order_count'] = customer_profile['order_count'].fillna(0)
print(customer_profile)Output:
customer_id customer_name city segment total_spent order_count
0 101 Rahul Mumbai Premium 12000 2
1 102 Priya Bangalore Regular 7500 2
2 103 Amit Delhi Premium 2000 1
3 104 Sneha Mumbai Regular 0 0
Performance Tips
1. Use Categorical Data Types for Grouping Columns:
# Convert to category for faster groupby
df['category'] = df['category'].astype('category')
df['city'] = df['city'].astype('category')
# Now groupby is faster
df.groupby(['city', 'category'])['sales'].sum()2. Avoid Chaining Multiple groupby() Calls:
# ❌ Slow: Multiple groupby operations
total_sales = df.groupby('category')['sales'].sum()
avg_price = df.groupby('category')['price'].mean()
# ✅ Fast: Single groupby with agg()
stats = df.groupby('category').agg({'sales': 'sum', 'price': 'mean'})3. Use merge() Instead of Multiple Lookups:
# ❌ Slow: Loop and lookup
for idx, row in df.iterrows():
customer_city = customers[customers['id'] == row['customer_id']]['city'].values[0]
# ✅ Fast: Single merge operation
df = df.merge(customers[['customer_id', 'city']], on='customer_id', how='left')Common Patterns
1. Top N per Group:
# Top 3 products by sales in each category
df.sort_values('sales', ascending=False).groupby('category').head(3)2. Rank within Groups:
# Rank products by sales within each category
df['category_rank'] = df.groupby('category')['sales'].rank(ascending=False, method='dense')3. Cumulative Sum per Group:
# Running total of sales per customer
df['cumulative_sales'] = df.groupby('customer_id')['sales'].cumsum()4. Percentage of Total per Group:
# Each order as % of customer's total spending
df['pct_of_customer_total'] = df['order_value'] / df.groupby('customer_id')['order_value'].transform('sum') * 100⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}
⚠️ FinalQuiz error: Missing or invalid questions array