What is E-commerce Sales Analysis Project — Complete Python Tutorial?

Complete e-commerce analytics project using Python and SQL. Includes dataset, RFM analysis, cohort retention, funnel analysis, and Power BI dashboard. Perfect for portfolio.

Is E-commerce Sales Analysis Project — Complete Python Tutorial suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 20 min to complete and includes 5 interactive quizzes to test your understanding.

How long does it take to learn E-commerce Sales Analysis Project — Complete Python Tutorial?

You can complete this topic in about 20 min. The topic is part 87 of undefined in our comprehensive Data Analytics Learning Path.

E-commerce Sales Analysis Project — Python, SQL, RFM Analysis | DataPath

🎯

Project Overview

What You'll Build:

A complete e-commerce analytics project analyzing 500K+ transactions from an online retail store, answering business questions like:

Customer Segmentation: Who are our best customers? (RFM analysis)
Cohort Retention: How many customers return each month?
Sales Trends: What drives revenue growth?
Product Analysis: Which products are frequently bought together?
Geographic Insights: Which countries generate most revenue?

Skills Demonstrated:

✅ Data cleaning (handling nulls, duplicates, outliers)
✅ SQL queries (aggregations, JOINs, window functions)
✅ Python (pandas, matplotlib, seaborn)
✅ RFM analysis & customer segmentation
✅ Cohort retention analysis
✅ Power BI dashboard (interactive visualizations)

Dataset:

Online Retail Dataset (Kaggle)

Size: 541,909 transactions
Period: Dec 2010 - Dec 2011 (12 months)
Features: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
Download: Kaggle Link

Project Structure:

ecommerce-analysis/
├── data/
│   ├── online_retail.csv
│   └── cleaned_data.csv
├── notebooks/
│   ├── 01_data_cleaning.ipynb
│   ├── 02_exploratory_analysis.ipynb
│   ├── 03_rfm_segmentation.ipynb
│   └── 04_cohort_analysis.ipynb
├── sql/
│   └── queries.sql
├── dashboard/
│   └── sales_dashboard.pbix
└── README.md

🧹

Step 1: Data Cleaning

Load and Inspect Data:

code.pyPython

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Load dataset
df = pd.read_csv('online_retail.csv', encoding='latin1')

# Basic inspection
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst few rows:\n{df.head()}")

Output:

Dataset shape: (541909, 8)

Missing values:
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080  ← 25% missing!
Country             0

Data Cleaning Steps:

code.pyPython

# 1. Remove rows with missing CustomerID (can't do customer analysis without it)
print(f"Before: {len(df)} rows")
df = df[df['CustomerID'].notna()]
print(f"After removing nulls: {len(df)} rows")

# 2. Remove canceled orders (InvoiceNo starts with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
print(f"After removing cancellations: {len(df)} rows")

# 3. Remove invalid quantities and prices
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
print(f"After removing invalid values: {len(df)} rows")

# 4. Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# 5. Create calculated columns
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

# 6. Remove outliers (orders > 99th percentile)
upper_limit = df['TotalPrice'].quantile(0.99)
df = df[df['TotalPrice'] <= upper_limit]

print(f"\nFinal dataset: {len(df)} rows")
print(f"Date range: {df['InvoiceDate'].min()} to {df['InvoiceDate'].max()}")
print(f"Unique customers: {df['CustomerID'].nunique()}")
print(f"Unique products: {df['StockCode'].nunique()}")

Save Cleaned Data:

code.pyPython

df.to_csv('cleaned_data.csv', index=False)
print("Cleaned data saved!")

👥

Step 2: RFM Analysis & Customer Segmentation

What is RFM?

RFM stands for:

Recency: Days since last purchase
Frequency: Number of purchases
Monetary: Total amount spent

Calculate RFM Metrics:

code.pyPython

# Set analysis date (day after last transaction)
analysis_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

# Calculate RFM for each customer
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (analysis_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',  # Frequency
    'TotalPrice': 'sum'  # Monetary
}).reset_index()

rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

print(rfm.head())
print(f"\nRFM Summary:\n{rfm.describe()}")

Create RFM Scores (1-5 scale):

code.pyPython

# Score each metric (5 = best, 1 = worst)
# Recency: lower is better (recent purchase = 5)
# Frequency & Monetary: higher is better

rfm['R_Score'] = pd.qcut(rfm['Recency'], q=5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], q=5, labels=[1,2,3,4,5])

# Combine into RFM Score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Simplified segment (high/medium/low)
rfm['RFM_Total'] = rfm[['R_Score', 'F_Score', 'M_Score']].astype(int).sum(axis=1)

def segment_customer(score):
    if score >= 12:
        return 'Champions'
    elif score >= 9:
        return 'Loyal'
    elif score >= 6:
        return 'Potential'
    else:
        return 'At Risk'

rfm['Segment'] = rfm['RFM_Total'].apply(segment_customer)

print(rfm['Segment'].value_counts())
print(f"\nSegment Revenue:\n{rfm.groupby('Segment')['Monetary'].agg(['count', 'sum', 'mean'])}")

Visualize RFM Segments:

code.pyPython

# Segment distribution
plt.figure(figsize=(10, 6))
segment_counts = rfm['Segment'].value_counts()
plt.bar(segment_counts.index, segment_counts.values, color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
plt.title('Customer Segmentation by RFM', fontsize=14, fontweight='bold')
plt.xlabel('Segment')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)
for i, v in enumerate(segment_counts.values):
    plt.text(i, v + 50, str(v), ha='center', fontweight='bold')
plt.tight_layout()
plt.savefig('rfm_segments.png', dpi=300)
plt.show()

# Revenue by segment
segment_revenue = rfm.groupby('Segment')['Monetary'].sum().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
plt.bar(segment_revenue.index, segment_revenue.values, color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
plt.title('Total Revenue by Customer Segment', fontsize=14, fontweight='bold')
plt.xlabel('Segment')
plt.ylabel('Revenue (£)')
plt.xticks(rotation=45)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.savefig('segment_revenue.png', dpi=300)
plt.show()

Insight Example:

"Champions (top 15% of customers) generate 45% of total revenue despite being only 627 customers. Focus retention efforts here."

⚠️ CheckpointQuiz error: Missing or invalid options array

📈

Step 3: Cohort Retention Analysis

What is Cohort Analysis?

Track how customers from each signup month behave over time. Shows retention patterns.

Create Cohorts:

code.pyPython

# Identify first purchase date for each customer
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')

# Calculate months since first purchase
df['CohortIndex'] = (df['YearMonth'] - df['CohortMonth']).apply(lambda x: x.n)

# Create cohort table
cohort_data = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index()
cohort_pivot = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

print("Cohort Table (Customer Counts):")
print(cohort_pivot)

Calculate Retention Rates:

code.pyPython

# Retention rate = customers in month N / customers in month 0
cohort_size = cohort_pivot.iloc[:, 0]
retention = cohort_pivot.divide(cohort_size, axis=0) * 100

print("\nRetention Rates (%):")
print(retention.round(1))

Visualize Cohort Heatmap:

code.pyPython

plt.figure(figsize=(12, 8))
sns.heatmap(retention, annot=True, fmt='.0f', cmap='RdYlGn', vmin=0, vmax=100, cbar_kws={'label': 'Retention %'})
plt.title('Cohort Retention Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Months Since First Purchase')
plt.ylabel('Cohort Month')
plt.tight_layout()
plt.savefig('cohort_retention.png', dpi=300)
plt.show()

Insight Example:

"Month 0 → Month 1 retention averages 35%. By Month 3, only 20% of customers return. Implement win-back campaign at 45-day mark."

🗄️

Step 4: SQL Analysis (Alternative Approach)

If you have the data in a SQL database, here are equivalent queries:

Top Customers by Revenue:

query.sqlSQL

-- Top 20 customers by total spend
SELECT
    CustomerID,
    COUNT(DISTINCT InvoiceNo) AS total_orders,
    SUM(Quantity) AS total_items,
    SUM(Quantity * UnitPrice) AS total_revenue
FROM online_retail
WHERE CustomerID IS NOT NULL
GROUP BY CustomerID
ORDER BY total_revenue DESC
LIMIT 20;

Monthly Sales Trend:

query.sqlSQL

-- Revenue and orders by month
SELECT
    DATE_TRUNC('month', InvoiceDate) AS month,
    COUNT(DISTINCT InvoiceNo) AS orders,
    COUNT(DISTINCT CustomerID) AS unique_customers,
    SUM(Quantity * UnitPrice) AS revenue,
    SUM(Quantity * UnitPrice) / COUNT(DISTINCT InvoiceNo) AS avg_order_value
FROM online_retail
WHERE CustomerID IS NOT NULL
    AND Quantity > 0
    AND UnitPrice > 0
GROUP BY month
ORDER BY month;

Product Affinity (Market Basket):

query.sqlSQL

-- Products frequently bought together
WITH product_pairs AS (
    SELECT
        a.InvoiceNo,
        a.Description AS product_a,
        b.Description AS product_b
    FROM online_retail a
    JOIN online_retail b ON a.InvoiceNo = b.InvoiceNo AND a.StockCode < b.StockCode
    WHERE a.CustomerID IS NOT NULL
)

SELECT
    product_a,
    product_b,
    COUNT(*) AS pair_count
FROM product_pairs
GROUP BY product_a, product_b
HAVING COUNT(*) > 50
ORDER BY pair_count DESC
LIMIT 20;

📊

Step 5: Power BI Dashboard

Dashboard Components:

KPI Cards (top row):
- Total Revenue: £9.8M
- Total Orders: 25,900
- Unique Customers: 4,372
- Avg Order Value: £378
Sales Trend (line chart):
- X-axis: Month
- Y-axis: Revenue
- Show trend line
Top Countries (bar chart):
- UK: £8.2M (84%)
- Germany: £0.9M (9%)
- France: £0.4M (4%)
Customer Segments (pie chart):
- Champions: 45% revenue
- Loyal: 30%
- Potential: 18%
- At Risk: 7%
Cohort Retention (matrix visual):
- Import retention table from Python
- Conditional formatting (green → red)
Filters (slicer):
- Date range
- Country
- Customer segment

Power BI Steps:

Import Data: Get Data → Text/CSV → cleaned_data.csv
Create Measures:

measure.daxDAX

Total Revenue = SUM(online_retail[TotalPrice])
Total Orders = DISTINCTCOUNT(online_retail[InvoiceNo])
Unique Customers = DISTINCTCOUNT(online_retail[CustomerID])
Avg Order Value = [Total Revenue] / [Total Orders]

Build Visuals: Drag fields to canvas
Format: Apply theme, adjust colors
Publish: Save as .pbix, upload to Tableau Public alternative or share screenshot

🚀

Extension Challenges

Take this project further:

1. Advanced Analytics:

Predict Customer Churn: Build logistic regression model (scikit-learn) to predict which customers won't return
Product Recommendation: Implement collaborative filtering for "customers who bought X also bought Y"
Lifetime Value (LTV): Calculate expected revenue per customer based on cohort data

2. Additional Analyses:

Seasonality: Does revenue spike during holidays? (Christmas in UK)
Day-of-week patterns: Which days have highest sales?
Product categories: Create product taxonomy, analyze category trends

3. Data Engineering:

Automate ETL: Create Python script that cleans data automatically
Schedule dashboard refresh: Set up daily data refresh in Power BI Service
Database setup: Load data into PostgreSQL, practice SQL queries

4. Presentation:

Write blog post: "5 Insights from Analyzing 500K E-commerce Transactions"
Create video walkthrough: 5-minute YouTube demo of your analysis
LinkedIn post: Share 3 key insights with visualizations

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}