Project Overview
What You'll Build:
A complete e-commerce analytics project analyzing 500K+ transactions from an online retail store, answering business questions like:
- Customer Segmentation: Who are our best customers? (RFM analysis)
- Cohort Retention: How many customers return each month?
- Sales Trends: What drives revenue growth?
- Product Analysis: Which products are frequently bought together?
- Geographic Insights: Which countries generate most revenue?
Skills Demonstrated:
- ✅ Data cleaning (handling nulls, duplicates, outliers)
- ✅ SQL queries (aggregations, JOINs, window functions)
- ✅ Python (pandas, matplotlib, seaborn)
- ✅ RFM analysis & customer segmentation
- ✅ Cohort retention analysis
- ✅ Power BI dashboard (interactive visualizations)
Dataset:
Online Retail Dataset (Kaggle)
- Size: 541,909 transactions
- Period: Dec 2010 - Dec 2011 (12 months)
- Features: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
- Download: Kaggle Link
Project Structure:
ecommerce-analysis/
├── data/
│ ├── online_retail.csv
│ └── cleaned_data.csv
├── notebooks/
│ ├── 01_data_cleaning.ipynb
│ ├── 02_exploratory_analysis.ipynb
│ ├── 03_rfm_segmentation.ipynb
│ └── 04_cohort_analysis.ipynb
├── sql/
│ └── queries.sql
├── dashboard/
│ └── sales_dashboard.pbix
└── README.md
Step 1: Data Cleaning
Load and Inspect Data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# Load dataset
df = pd.read_csv('online_retail.csv', encoding='latin1')
# Basic inspection
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst few rows:\n{df.head()}")Output:
Dataset shape: (541909, 8)
Missing values:
InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080 ← 25% missing!
Country 0
Data Cleaning Steps:
# 1. Remove rows with missing CustomerID (can't do customer analysis without it)
print(f"Before: {len(df)} rows")
df = df[df['CustomerID'].notna()]
print(f"After removing nulls: {len(df)} rows")
# 2. Remove canceled orders (InvoiceNo starts with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
print(f"After removing cancellations: {len(df)} rows")
# 3. Remove invalid quantities and prices
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
print(f"After removing invalid values: {len(df)} rows")
# 4. Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# 5. Create calculated columns
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')
# 6. Remove outliers (orders > 99th percentile)
upper_limit = df['TotalPrice'].quantile(0.99)
df = df[df['TotalPrice'] <= upper_limit]
print(f"\nFinal dataset: {len(df)} rows")
print(f"Date range: {df['InvoiceDate'].min()} to {df['InvoiceDate'].max()}")
print(f"Unique customers: {df['CustomerID'].nunique()}")
print(f"Unique products: {df['StockCode'].nunique()}")Save Cleaned Data:
df.to_csv('cleaned_data.csv', index=False)
print("Cleaned data saved!")Step 2: RFM Analysis & Customer Segmentation
What is RFM?
RFM stands for:
- Recency: Days since last purchase
- Frequency: Number of purchases
- Monetary: Total amount spent
Calculate RFM Metrics:
# Set analysis date (day after last transaction)
analysis_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
# Calculate RFM for each customer
rfm = df.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (analysis_date - x.max()).days, # Recency
'InvoiceNo': 'nunique', # Frequency
'TotalPrice': 'sum' # Monetary
}).reset_index()
rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']
print(rfm.head())
print(f"\nRFM Summary:\n{rfm.describe()}")Create RFM Scores (1-5 scale):
# Score each metric (5 = best, 1 = worst)
# Recency: lower is better (recent purchase = 5)
# Frequency & Monetary: higher is better
rfm['R_Score'] = pd.qcut(rfm['Recency'], q=5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], q=5, labels=[1,2,3,4,5])
# Combine into RFM Score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)
# Simplified segment (high/medium/low)
rfm['RFM_Total'] = rfm[['R_Score', 'F_Score', 'M_Score']].astype(int).sum(axis=1)
def segment_customer(score):
if score >= 12:
return 'Champions'
elif score >= 9:
return 'Loyal'
elif score >= 6:
return 'Potential'
else:
return 'At Risk'
rfm['Segment'] = rfm['RFM_Total'].apply(segment_customer)
print(rfm['Segment'].value_counts())
print(f"\nSegment Revenue:\n{rfm.groupby('Segment')['Monetary'].agg(['count', 'sum', 'mean'])}")Visualize RFM Segments:
# Segment distribution
plt.figure(figsize=(10, 6))
segment_counts = rfm['Segment'].value_counts()
plt.bar(segment_counts.index, segment_counts.values, color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
plt.title('Customer Segmentation by RFM', fontsize=14, fontweight='bold')
plt.xlabel('Segment')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)
for i, v in enumerate(segment_counts.values):
plt.text(i, v + 50, str(v), ha='center', fontweight='bold')
plt.tight_layout()
plt.savefig('rfm_segments.png', dpi=300)
plt.show()
# Revenue by segment
segment_revenue = rfm.groupby('Segment')['Monetary'].sum().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
plt.bar(segment_revenue.index, segment_revenue.values, color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
plt.title('Total Revenue by Customer Segment', fontsize=14, fontweight='bold')
plt.xlabel('Segment')
plt.ylabel('Revenue (£)')
plt.xticks(rotation=45)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.savefig('segment_revenue.png', dpi=300)
plt.show()Insight Example:
"Champions (top 15% of customers) generate 45% of total revenue despite being only 627 customers. Focus retention efforts here."
⚠️ CheckpointQuiz error: Missing or invalid options array
Step 3: Cohort Retention Analysis
What is Cohort Analysis?
Track how customers from each signup month behave over time. Shows retention patterns.
Create Cohorts:
# Identify first purchase date for each customer
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')
# Calculate months since first purchase
df['CohortIndex'] = (df['YearMonth'] - df['CohortMonth']).apply(lambda x: x.n)
# Create cohort table
cohort_data = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index()
cohort_pivot = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')
print("Cohort Table (Customer Counts):")
print(cohort_pivot)Calculate Retention Rates:
# Retention rate = customers in month N / customers in month 0
cohort_size = cohort_pivot.iloc[:, 0]
retention = cohort_pivot.divide(cohort_size, axis=0) * 100
print("\nRetention Rates (%):")
print(retention.round(1))Visualize Cohort Heatmap:
plt.figure(figsize=(12, 8))
sns.heatmap(retention, annot=True, fmt='.0f', cmap='RdYlGn', vmin=0, vmax=100, cbar_kws={'label': 'Retention %'})
plt.title('Cohort Retention Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Months Since First Purchase')
plt.ylabel('Cohort Month')
plt.tight_layout()
plt.savefig('cohort_retention.png', dpi=300)
plt.show()Insight Example:
"Month 0 → Month 1 retention averages 35%. By Month 3, only 20% of customers return. Implement win-back campaign at 45-day mark."
Step 4: SQL Analysis (Alternative Approach)
If you have the data in a SQL database, here are equivalent queries:
Top Customers by Revenue:
-- Top 20 customers by total spend
SELECT
CustomerID,
COUNT(DISTINCT InvoiceNo) AS total_orders,
SUM(Quantity) AS total_items,
SUM(Quantity * UnitPrice) AS total_revenue
FROM online_retail
WHERE CustomerID IS NOT NULL
GROUP BY CustomerID
ORDER BY total_revenue DESC
LIMIT 20;Monthly Sales Trend:
-- Revenue and orders by month
SELECT
DATE_TRUNC('month', InvoiceDate) AS month,
COUNT(DISTINCT InvoiceNo) AS orders,
COUNT(DISTINCT CustomerID) AS unique_customers,
SUM(Quantity * UnitPrice) AS revenue,
SUM(Quantity * UnitPrice) / COUNT(DISTINCT InvoiceNo) AS avg_order_value
FROM online_retail
WHERE CustomerID IS NOT NULL
AND Quantity > 0
AND UnitPrice > 0
GROUP BY month
ORDER BY month;Product Affinity (Market Basket):
-- Products frequently bought together
WITH product_pairs AS (
SELECT
a.InvoiceNo,
a.Description AS product_a,
b.Description AS product_b
FROM online_retail a
JOIN online_retail b ON a.InvoiceNo = b.InvoiceNo AND a.StockCode < b.StockCode
WHERE a.CustomerID IS NOT NULL
)
SELECT
product_a,
product_b,
COUNT(*) AS pair_count
FROM product_pairs
GROUP BY product_a, product_b
HAVING COUNT(*) > 50
ORDER BY pair_count DESC
LIMIT 20;Step 5: Power BI Dashboard
Dashboard Components:
-
KPI Cards (top row):
- Total Revenue: £9.8M
- Total Orders: 25,900
- Unique Customers: 4,372
- Avg Order Value: £378
-
Sales Trend (line chart):
- X-axis: Month
- Y-axis: Revenue
- Show trend line
-
Top Countries (bar chart):
- UK: £8.2M (84%)
- Germany: £0.9M (9%)
- France: £0.4M (4%)
-
Customer Segments (pie chart):
- Champions: 45% revenue
- Loyal: 30%
- Potential: 18%
- At Risk: 7%
-
Cohort Retention (matrix visual):
- Import retention table from Python
- Conditional formatting (green → red)
-
Filters (slicer):
- Date range
- Country
- Customer segment
Power BI Steps:
- Import Data: Get Data → Text/CSV → cleaned_data.csv
- Create Measures:
Total Revenue = SUM(online_retail[TotalPrice])
Total Orders = DISTINCTCOUNT(online_retail[InvoiceNo])
Unique Customers = DISTINCTCOUNT(online_retail[CustomerID])
Avg Order Value = [Total Revenue] / [Total Orders]- Build Visuals: Drag fields to canvas
- Format: Apply theme, adjust colors
- Publish: Save as .pbix, upload to Tableau Public alternative or share screenshot
Extension Challenges
Take this project further:
1. Advanced Analytics:
- Predict Customer Churn: Build logistic regression model (scikit-learn) to predict which customers won't return
- Product Recommendation: Implement collaborative filtering for "customers who bought X also bought Y"
- Lifetime Value (LTV): Calculate expected revenue per customer based on cohort data
2. Additional Analyses:
- Seasonality: Does revenue spike during holidays? (Christmas in UK)
- Day-of-week patterns: Which days have highest sales?
- Product categories: Create product taxonomy, analyze category trends
3. Data Engineering:
- Automate ETL: Create Python script that cleans data automatically
- Schedule dashboard refresh: Set up daily data refresh in Power BI Service
- Database setup: Load data into PostgreSQL, practice SQL queries
4. Presentation:
- Write blog post: "5 Insights from Analyzing 500K E-commerce Transactions"
- Create video walkthrough: 5-minute YouTube demo of your analysis
- LinkedIn post: Share 3 key insights with visualizations
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}