Setup & Dataset
Installation:
pip install pandas numpy matplotlib seabornSample Dataset:
import pandas as pd
import numpy as np
# Create sample sales data
data = {
'date': pd.date_range('2025-01-01', periods=100),
'product': np.random.choice(['A', 'B', 'C'], 100),
'category': np.random.choice(['Electronics', 'Fashion'], 100),
'quantity': np.random.randint(1, 20, 100),
'price': np.random.randint(100, 1000, 100),
'city': np.random.choice(['Mumbai', 'Delhi', 'Bangalore'], 100)
}
df = pd.DataFrame(data)
df['revenue'] = df['quantity'] * df['price']Beginner Problems (1-8)
Problem 1: Basic filtering
Question: Filter rows where revenue > 5000 and city is Mumbai.
Solution:
result = df[(df['revenue'] > 5000) & (df['city'] == 'Mumbai')]
print(result)Problem 2: Group by and aggregate
Question: Calculate total revenue and average quantity by product.
Solution:
result = df.groupby('product').agg({
'revenue': 'sum',
'quantity': 'mean'
}).round(2)
print(result)Problem 3: Handle missing values
Question: Replace missing values in 'quantity' with column median.
Solution:
df['quantity'].fillna(df['quantity'].median(), inplace=True)
# Alternative: drop rows with ANY null
df_clean = df.dropna()Problem 4: Create new column
Question: Create 'price_category' column: Low (<500), Medium (500-800), High (>800).
Solution:
df['price_category'] = pd.cut(
df['price'],
bins=[0, 500, 800, float('inf')],
labels=['Low', 'Medium', 'High']
)
# Alternative using np.where or apply
df['price_category'] = df['price'].apply(
lambda x: 'Low' if x < 500 else ('Medium' if x < 800 else 'High')
)⚠️ CheckpointQuiz error: Missing or invalid options array
Intermediate Problems (9-15)
Problem 9: Pivot table
Question: Create pivot table showing total revenue by product (rows) and city (columns).
Solution:
pivot = df.pivot_table(
values='revenue',
index='product',
columns='city',
aggfunc='sum',
fill_value=0
)
print(pivot)Problem 10: Merge datasets
Question: Merge sales data with product info on product ID.
Solution:
# Sample product info
product_info = pd.DataFrame({
'product': ['A', 'B', 'C'],
'product_name': ['Widget', 'Gadget', 'Tool'],
'cost': [50, 75, 100]
})
merged = df.merge(
product_info,
on='product',
how='left' # Keep all sales records
)Problem 11: Time series resampling
Question: Calculate monthly total revenue from daily data.
Solution:
df_monthly = df.set_index('date').resample('M')['revenue'].sum()
print(df_monthly)
# With additional stats
monthly_stats = df.set_index('date').resample('M').agg({
'revenue': 'sum',
'quantity': 'mean',
'product': 'count'
})Problem 12: Running total
Question: Calculate cumulative revenue over time.
Solution:
df_sorted = df.sort_values('date')
df_sorted['cumulative_revenue'] = df_sorted['revenue'].cumsum()Problem 13: Remove duplicates
Question: Remove duplicate rows based on date + product combination.
Solution:
df_unique = df.drop_duplicates(subset=['date', 'product'], keep='first')
print(f"Removed {len(df) - len(df_unique)} duplicates")Advanced Problems (16-20)
Problem 16: Complex aggregation
Question: For each city, find top 3 products by revenue.
Solution:
top_products = (
df.groupby(['city', 'product'])['revenue']
.sum()
.reset_index()
.sort_values(['city', 'revenue'], ascending=[True, False])
.groupby('city')
.head(3)
)
print(top_products)Problem 17: Apply custom function
Question: Calculate profit margin (assuming 30% cost).
Solution:
def calculate_profit_margin(row):
cost = row['revenue'] * 0.30
profit = row['revenue'] - cost
return (profit / row['revenue']) * 100
df['profit_margin'] = df.apply(calculate_profit_margin, axis=1)
# Vectorized (faster) alternative
df['profit_margin'] = ((df['revenue'] - df['revenue'] * 0.30) / df['revenue']) * 100Problem 18: Handle outliers
Question: Cap revenue at 99th percentile to remove outliers.
Solution:
upper_limit = df['revenue'].quantile(0.99)
df['revenue_capped'] = df['revenue'].clip(upper=upper_limit)
# Or remove outliers using IQR
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df[
(df['revenue'] >= Q1 - 1.5 * IQR) &
(df['revenue'] <= Q3 + 1.5 * IQR)
]Problem 19: Percentage change
Question: Calculate month-over-month revenue growth rate.
Solution:
monthly_rev = df.set_index('date').resample('M')['revenue'].sum()
monthly_rev_growth = monthly_rev.pct_change() * 100
print(f"Month-over-month growth:\n{monthly_rev_growth}")Problem 20: Correlation analysis
Question: Find correlation between price and quantity sold.
Solution:
correlation = df['price'].corr(df['quantity'])
print(f"Correlation: {correlation:.3f}")
# Correlation matrix for multiple columns
corr_matrix = df[['price', 'quantity', 'revenue']].corr()
print(corr_matrix)
# Visualize
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()Python Practice Tips
Common pandas Operations:
| Task | Method | Example |
|------|--------|---------|
| Filter rows | Boolean indexing | df[df['revenue'] > 1000] |
| Select columns | Double brackets | df[['product', 'revenue']] |
| Group & aggregate | groupby + agg | df.groupby('city').agg({'revenue': 'sum'}) |
| Sort | sort_values | df.sort_values('revenue', ascending=False) |
| Handle nulls | fillna, dropna | df['quantity'].fillna(0) |
| Merge tables | merge | df1.merge(df2, on='id', how='left') |
| Pivot | pivot_table | df.pivot_table(values='revenue', index='product') |
| Apply function | apply | df['new_col'] = df.apply(func, axis=1) |
Performance Tips:
✅ Vectorize instead of loops:
# Slow (loop)
for i, row in df.iterrows():
df.at[i, 'total'] = row['price'] * row['quantity']
# Fast (vectorized)
df['total'] = df['price'] * df['quantity']✅ Use query() for readability:
# Traditional
df[(df['revenue'] > 5000) & (df['city'] == 'Mumbai')]
# Query (cleaner for complex conditions)
df.query('revenue > 5000 and city == "Mumbai"')✅ Chain operations:
result = (
df
.query('revenue > 1000')
.groupby('product')['revenue']
.sum()
.sort_values(ascending=False)
.head(5)
)⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}