What is Data Manipulation with Pandas: GroupBy and Merge?

Master Pandas groupby() for data aggregation and merge() for joining DataFrames. Complete guide with real examples from Indian e-commerce companies.

Is Data Manipulation with Pandas: GroupBy and Merge suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 20 to complete and includes 2 interactive quizzes to test your understanding.

How long does it take to learn Data Manipulation with Pandas: GroupBy and Merge?

You can complete this topic in about 20. The topic is part 15 of undefined in our comprehensive Data Analytics Learning Path.

Pandas GroupBy and Merge Tutorial: Complete Guide | DataPath

Introduction

Pandas groupby and merge are the building blocks of data analysis in Python. They're how you answer questions like:

Flipkart: "What's the average order value per category?"
Swiggy: "How many unique customers ordered from each restaurant?"
Zomato: "Join customer data with order data to analyze spending patterns"

If you know SQL's GROUP BY and JOIN, you already understand 80% of what groupby() and merge() do. If not, don't worry—we'll build from the ground up.

GroupBy: The Split-Apply-Combine Pattern

groupby() follows a simple pattern:

Split data into groups based on column values
Apply a function to each group (sum, mean, count, etc.)
Combine results into a new DataFrame

Think of it this way...

Think of groupby like sorting laundry. You split clothes into groups (whites, colors, delicates). Then you apply the same operation to each group (wash). Finally, you combine the clean clothes back together. groupby() does this with data instead of clothes.

Basic Example:

code.pyPython

import pandas as pd

# Flipkart sales data
sales = pd.DataFrame({
    'product': ['iPhone', 'MacBook', 'iPhone', 'iPad', 'MacBook', 'iPad'],
    'category': ['Phone', 'Laptop', 'Phone', 'Tablet', 'Laptop', 'Tablet'],
    'sales': [50000, 80000, 48000, 35000, 75000, 32000]
})

# Calculate total sales per category
category_sales = sales.groupby('category')['sales'].sum()
print(category_sales)

Output:

category
Laptop    155000
Phone      98000
Tablet     67000
Name: sales, dtype: int64

⚠️ CheckpointQuiz error: Missing or invalid options array

Common GroupBy Aggregations

Single Aggregation:

code.pyPython

# Total sales per category
df.groupby('category')['sales'].sum()

# Average price per brand
df.groupby('brand')['price'].mean()

# Count orders per customer
df.groupby('customer_id')['order_id'].count()

# Count unique products per category
df.groupby('category')['product_id'].nunique()

Multiple Aggregations with agg():

code.pyPython

# Multiple metrics for sales
df.groupby('category')['sales'].agg(['sum', 'mean', 'count', 'min', 'max'])

# Different functions for different columns
df.groupby('category').agg({
    'sales': 'sum',
    'quantity': 'mean',
    'customer_id': 'nunique'
})

Example - Swiggy Restaurant Analysis:

code.pyPython

# Restaurant performance by city
restaurant_stats = df.groupby(['city', 'restaurant']).agg({
    'order_id': 'count',           # Total orders
    'order_value': ['mean', 'sum'], # Avg and total revenue
    'delivery_time': 'mean',        # Avg delivery time
    'customer_id': 'nunique'        # Unique customers
}).reset_index()

print(restaurant_stats.head())

Output:

     city      restaurant  order_id order_value_mean  order_value_sum  delivery_time_mean  customer_id_nunique
0  Bangalore  Truffles      1500          350              525000             25                 890
1  Bangalore  Empire        2000          280              560000             30                1200
2  Delhi      Karim's       1800          420              756000             35                1100

Tip

Pro Tip: Use reset_index() after groupby to convert the grouped columns from index back to regular columns. This makes the DataFrame easier to work with for further analysis or visualization.

GroupBy with Multiple Columns

Group by multiple columns to create hierarchical groups.

Example - Sales by City and Category:

code.pyPython

# Group by city, then category within each city
city_category_sales = df.groupby(['city', 'category'])['sales'].sum()
print(city_category_sales)

Output:

city       category
Bangalore  Laptop      250000
           Phone       180000
           Tablet       95000
Delhi      Laptop      320000
           Phone       220000
           Tablet      110000

Unstack for Better Visualization:

code.pyPython

# Convert to pivot table format
pivot = city_category_sales.unstack(fill_value=0)
print(pivot)

Output:

category   Laptop  Phone  Tablet
city
Bangalore  250000  180000   95000
Delhi      320000  220000  110000

Filtering Groups with filter()

Keep only groups that meet a condition.

Example - Zomato High-Performing Restaurants:

code.pyPython

# Keep only restaurants with >100 orders
high_volume = df.groupby('restaurant').filter(lambda x: len(x) > 100)

# Or using a more complex condition
profitable = df.groupby('restaurant').filter(
    lambda x: x['revenue'].sum() > 50000 and x['rating'].mean() >= 4.0
)

Merge: Joining DataFrames

merge() combines DataFrames based on common columns, just like SQL JOINs.

Types of Joins:

inner: Keep only matching rows (intersection)
left: Keep all left rows, match from right
right: Keep all right rows, match from left
outer: Keep all rows from both (union)

Basic Syntax:

code.pyPython

# Merge on single column
result = df1.merge(df2, on='user_id', how='inner')

# Merge on multiple columns
result = df1.merge(df2, on=['user_id', 'product_id'], how='left')

# Merge on different column names
result = df1.merge(df2, left_on='customer_id', right_on='user_id', how='inner')

Example: Flipkart Order + Customer Analysis

Scenario: You have separate DataFrames for orders and customers. Merge them to analyze spending by customer segment.

code.pyPython

# Orders data
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'customer_id': [101, 102, 101, 103, 102],
    'order_value': [5000, 3000, 7000, 2000, 4500],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19']
})

# Customer data
customers = pd.DataFrame({
    'customer_id': [101, 102, 103, 104],
    'customer_name': ['Rahul', 'Priya', 'Amit', 'Sneha'],
    'city': ['Mumbai', 'Bangalore', 'Delhi', 'Mumbai'],
    'segment': ['Premium', 'Regular', 'Premium', 'Regular']
})

# Merge to add customer details to orders
merged = orders.merge(customers, on='customer_id', how='left')
print(merged)

Output:

   order_id  customer_id  order_value  order_date customer_name       city  segment
0         1          101         5000  2024-01-15         Rahul     Mumbai  Premium
1         2          102         3000  2024-01-16         Priya  Bangalore  Regular
2         3          101         7000  2024-01-17         Rahul     Mumbai  Premium
3         4          103         2000  2024-01-18          Amit      Delhi  Premium
4         5          102         4500  2024-01-19         Priya  Bangalore  Regular

Now analyze by segment:

code.pyPython

# Total spending by customer segment
segment_spending = merged.groupby('segment')['order_value'].agg(['sum', 'mean', 'count'])
print(segment_spending)

Output:

            sum     mean  count
segment
Premium   14000  4666.67      3
Regular    7500  3750.00      2

Warning

Join Type Matters!

how='left' keeps all orders even if customer data is missing (NaN for customer columns)
how='inner' only keeps orders with matching customer records
For most analytics, use how='left' to avoid accidentally dropping data

Indicator for Join Diagnostics

Use indicator=True to see which rows came from left, right, or both.

code.pyPython

merged = orders.merge(customers, on='customer_id', how='outer', indicator=True)
print(merged['_merge'].value_counts())

Output:

both          5  # Matched in both DataFrames
left_only     0  # Only in orders
right_only    1  # Only in customers (customer 104 had no orders)

Real-World Pattern: GroupBy + Merge

Scenario: Calculate each customer's total spending and merge it back with customer details.

code.pyPython

# Step 1: Aggregate order data
customer_spending = orders.groupby('customer_id').agg({
    'order_value': 'sum',
    'order_id': 'count'
}).rename(columns={'order_value': 'total_spent', 'order_id': 'order_count'}).reset_index()

print(customer_spending)

Output:

   customer_id  total_spent  order_count
0          101        12000            2
1          102         7500            2
2          103         2000            1

code.pyPython

# Step 2: Merge with customer data
customer_profile = customers.merge(customer_spending, on='customer_id', how='left')
customer_profile['total_spent'] = customer_profile['total_spent'].fillna(0)
customer_profile['order_count'] = customer_profile['order_count'].fillna(0)

print(customer_profile)

Output:

   customer_id customer_name       city  segment  total_spent  order_count
0          101         Rahul     Mumbai  Premium        12000            2
1          102         Priya  Bangalore  Regular         7500            2
2          103          Amit      Delhi  Premium         2000            1
3          104         Sneha     Mumbai  Regular            0            0

Performance Tips

1. Use Categorical Data Types for Grouping Columns:

code.pyPython

# Convert to category for faster groupby
df['category'] = df['category'].astype('category')
df['city'] = df['city'].astype('category')

# Now groupby is faster
df.groupby(['city', 'category'])['sales'].sum()

2. Avoid Chaining Multiple groupby() Calls:

code.pyPython

# ❌ Slow: Multiple groupby operations
total_sales = df.groupby('category')['sales'].sum()
avg_price = df.groupby('category')['price'].mean()

# ✅ Fast: Single groupby with agg()
stats = df.groupby('category').agg({'sales': 'sum', 'price': 'mean'})

3. Use merge() Instead of Multiple Lookups:

code.pyPython

# ❌ Slow: Loop and lookup
for idx, row in df.iterrows():
    customer_city = customers[customers['id'] == row['customer_id']]['city'].values[0]

# ✅ Fast: Single merge operation
df = df.merge(customers[['customer_id', 'city']], on='customer_id', how='left')

Common Patterns

1. Top N per Group:

code.pyPython

# Top 3 products by sales in each category
df.sort_values('sales', ascending=False).groupby('category').head(3)

2. Rank within Groups:

code.pyPython

# Rank products by sales within each category
df['category_rank'] = df.groupby('category')['sales'].rank(ascending=False, method='dense')

3. Cumulative Sum per Group:

code.pyPython

# Running total of sales per customer
df['cumulative_sales'] = df.groupby('customer_id')['sales'].cumsum()

4. Percentage of Total per Group:

code.pyPython

# Each order as % of customer's total spending
df['pct_of_customer_total'] = df['order_value'] / df.groupby('customer_id')['order_value'].transform('sum') * 100

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}

⚠️ FinalQuiz error: Missing or invalid questions array