Topic 12 of 12

Capstone: End-to-End Analytics Project

This is where everything comes together. Build a complete analytics project from scratch to showcase your skills.

πŸ“šAdvanced
⏱️25 min
βœ…0 quizzes

Project Overview

Scenario: You're a data analyst at "ShopKart", an e-commerce company. Leadership wants insights on Q1 2026 sales performance to guide Q2 strategy.

Your mission: Analyze sales data, identify trends, and make actionable recommendations.

The Analytics Workflow

Phase 1: Define the Problem

Phase 2: Collect Data

Phase 3: Clean Data

Phase 4: Explore & Analyze

Phase 5: Visualize Insights

Phase 6: Communicate Findings


Phase 1: Define the Problem

Business Questions

  1. What was our Q1 revenue and how does it compare to Q4 2025?
  2. Which products drove the most revenue?
  3. Which regions are performing best/worst?
  4. What's our customer retention rate?
  5. Are there seasonal patterns in sales?

Success Metrics

  • Total revenue
  • Month-over-month growth
  • Top 10 products by revenue
  • Revenue by region
  • Repeat customer rate

Phase 2: Collect Data

Data Sources

Tables to pull:

  1. orders - order_id, customer_id, product_id, amount, date, status
  2. customers - customer_id, name, email, city, region, signup_date
  3. products - product_id, name, category, price

SQL Query:

query.sqlSQL
SELECT
    o.order_id,
    o.customer_id,
    o.product_id,
    o.amount,
    o.order_date,
    o.status,
    c.name AS customer_name,
    c.city,
    c.region,
    p.product_name,
    p.category
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date BETWEEN '2026-01-01' AND '2026-03-31'
    AND o.status = 'Delivered';

Export: Save as q1_sales.csv


Phase 3: Clean Data

code.pyPython
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('q1_sales.csv')

# Check shape and info
print(f"Rows: {len(df)}, Columns: {len(df.columns)}")
print(df.info())

# 1. Handle missing values
print(df.isnull().sum())
df = df.dropna(subset=['amount', 'customer_id'])  # Drop critical nulls
df['city'].fillna('Unknown', inplace=True)        # Fill non-critical

# 2. Remove duplicates
print(f"Duplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# 3. Fix data types
df['order_date'] = pd.to_datetime(df['order_date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# 4. Clean text
df['city'] = df['city'].str.strip().str.title()
df['region'] = df['region'].str.upper()

# 5. Remove outliers (orders > β‚Ή1M likely errors)
df = df[df['amount'] < 1000000]

# 6. Create new columns
df['month'] = df['order_date'].dt.month
df['week'] = df['order_date'].dt.isocalendar().week

print(f"After cleaning: {len(df)} rows")

Phase 4: Explore & Analyze

1. Overall Revenue

code.pyPython
total_revenue = df['amount'].sum()
print(f"Q1 Total Revenue: β‚Ή{total_revenue:,.0f}")

# Monthly breakdown
monthly_revenue = df.groupby('month')['amount'].sum()
print(monthly_revenue)

Output:

Q1 Total Revenue: β‚Ή12,450,000 month 1 3,800,000 2 4,200,000 3 4,450,000

Insight: Revenue growing month-over-month (+10% Jan→Feb, +6% Feb→Mar)

2. Top Products

code.pyPython
top_products = df.groupby('product_name')['amount'].sum().sort_values(ascending=False).head(10)
print(top_products)

Output:

Laptop Pro 2,500,000 Wireless Earbuds 1,800,000 Smart Watch 1,200,000 ...

Insight: Electronics dominate top 10 (70% of revenue)

3. Regional Performance

code.pyPython
regional_revenue = df.groupby('region').agg({
    'amount': 'sum',
    'order_id': 'count'
}).rename(columns={'amount': 'revenue', 'order_id': 'orders'})

regional_revenue['avg_order_value'] = regional_revenue['revenue'] / regional_revenue['orders']
print(regional_revenue.sort_values('revenue', ascending=False))

Output: | Region | Revenue | Orders | AOV | |--------|---------|--------|-----| | WEST | β‚Ή4.5M | 2,100 | β‚Ή2,143 | | NORTH | β‚Ή3.8M | 1,900 | β‚Ή2,000 | | SOUTH | β‚Ή2.6M | 1,500 | β‚Ή1,733 | | EAST | β‚Ή1.5M | 1,000 | β‚Ή1,500 |

Insight: West leads in revenue, but East has lowest AOV (opportunity!)

4. Customer Retention

code.pyPython
# Customers who ordered in Q1
q1_customers = df['customer_id'].unique()

# Load Q4 2025 data
q4_df = pd.read_csv('q4_sales.csv')
q4_customers = q4_df['customer_id'].unique()

# Repeat customers
repeat_customers = np.intersect1d(q1_customers, q4_customers)

retention_rate = len(repeat_customers) / len(q4_customers) * 100
print(f"Retention Rate: {retention_rate:.1f}%")

Output: Retention Rate: 42%

Insight: Room for improvement (industry benchmark: 50-60%)

5. Trend Analysis

code.pyPython
daily_sales = df.groupby(df['order_date'].dt.date)['amount'].sum()

# 7-day moving average
daily_sales_ma = daily_sales.rolling(window=7).mean()

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.plot(daily_sales.index, daily_sales, label='Daily Sales', alpha=0.5)
plt.plot(daily_sales_ma.index, daily_sales_ma, label='7-day MA', linewidth=2)
plt.title('Q1 Daily Sales Trend')
plt.legend()
plt.savefig('sales_trend.png')

Insight: Clear weekly pattern (weekends higher), gradual upward trend


Phase 5: Visualize Insights

Dashboard Layout

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Total β”‚ Growth β”‚ Avg Order β”‚ β”‚ Revenue β”‚ % MoM β”‚ Value β”‚ β”‚ β‚Ή12.45M β”‚ +8% β”‚ β‚Ή1,950 β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Monthly Revenue Trend (Line Chart) β”‚ β”‚ [Jan, Feb, Mar trending up] β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Top Products β”‚ Regional Distribution β”‚ β”‚ (Bar Chart) β”‚ (Map or Bar) β”‚ β”‚ β”‚ West > North > South > Eastβ”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Product Category Mix (Pie/Donut) β”‚ β”‚ Electronics 55% | Fashion 25% | Other 20%β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tools: Power BI, Tableau, or Python (Plotly/Matplotlib)


Phase 6: Communicate Findings

Executive Summary

To: Leadership Team From: Data Analytics Team Re: Q1 2026 Sales Performance Analysis

Key Findings:

  1. βœ… Strong Q1 Performance

    • Total revenue: β‚Ή12.45M
    • 8% MoM growth trend
    • Beat Q1 2025 by 15%
  2. πŸ“Š Product Insights

    • Electronics drive 55% of revenue
    • Top 3 products account for 45% of sales
    • Opportunity: Expand mid-range product offerings
  3. πŸ—ΊοΈ Regional Performance

    • West region leads (36% of revenue)
    • East region lags with lowest AOV (β‚Ή1,500 vs β‚Ή2,143 company avg)
    • Opportunity: Targeted campaigns in East
  4. πŸ”„ Customer Retention

    • 42% retention rate (below 50% target)
    • Opportunity: Loyalty program, email campaigns
  5. πŸ“ˆ Trends

    • Clear weekly seasonality (weekend peaks)
    • Gradual upward trajectory through Q1
    • No major drop-offs or anomalies

Recommendations:

  1. Expand East region presence

    • Launch regional marketing campaigns
    • Offer region-specific discounts
    • Target AOV increase to β‚Ή2,000+
  2. Improve retention to 50%+

    • Implement loyalty rewards program
    • Personalized email sequences
    • Win-back campaigns for lapsed customers
  3. Optimize inventory for top products

    • Ensure Laptop Pro, Earbuds, Smart Watch always in stock
    • Negotiate better supplier rates (top 3 = high volume)
  4. Leverage weekend traffic

    • Schedule promotions for Fri-Sun
    • Increase ad spend on weekends
    • Flash sales during peak times

Next Steps:

  • Implement recommendations in Q2
  • Monitor weekly KPIs
  • Run A/B tests on retention tactics

Building Your Portfolio

What to Include

  1. README.md

    • Project overview
    • Business questions
    • Data sources
    • Key findings
  2. Jupyter Notebook

    • Complete analysis code
    • Visualizations
    • Commentary
  3. Dashboard Screenshot/Link

    • Power BI / Tableau dashboard
    • Interactive if possible
  4. Presentation Deck

    • Executive summary (5-10 slides)
    • Charts and insights
    • Recommendations

Portfolio Projects Ideas

  1. Sales Analysis (like this capstone)
  2. Customer Segmentation (RFM analysis)
  3. Churn Prediction (identify at-risk customers)
  4. A/B Test Analysis (landing page experiment)
  5. Social Media Sentiment Analysis (Twitter/Reddit)
  6. Web Analytics (Google Analytics deep dive)

Summary: The Complete Analytics Workflow

βœ… Define: Clear business questions and success metrics βœ… Collect: SQL queries, APIs, web scraping βœ… Clean: Handle nulls, duplicates, outliers, data types βœ… Analyze: Aggregations, trends, comparisons, correlations βœ… Visualize: Dashboards, charts, clear storytelling βœ… Communicate: Insights, recommendations, action items

You've completed the DataPath! πŸŽ‰

You now have the skills to:

  • Understand data types and analytics fundamentals
  • Master Excel, SQL, Python, and Pandas
  • Create compelling visualizations and dashboards
  • Apply statistics and run A/B tests
  • Execute end-to-end analytics projects

Next Steps:

  1. Build 2-3 portfolio projects
  2. Share on LinkedIn/GitHub
  3. Apply for data analyst roles
  4. Keep learning and growing!

Congratulations on finishing the course! πŸš€