#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
15 min read min read

Sales Data Analysis Project

Build a complete sales analysis project from scratch

Sales Data Analysis Project

Sales Data Analysis

Project Overview

In this project, you will analyze a company's sales data to find:

  • Which products sell the most?
  • Which months have highest sales?
  • Which cities generate most revenue?
  • What patterns exist in the data?

Skills you'll practice:

  • Data loading and cleaning
  • Exploratory data analysis
  • Data visualization
  • Drawing business insights

Step 1: Setup and Import Libraries

First, let's import everything we need:

code.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Make plots look nice
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("Libraries imported successfully!")

Step 2: Create Sample Data

For this project, we'll create realistic sales data:

code.py
# Create sample sales data
np.random.seed(42)

# Generate 1000 sales records
n_records = 1000

data = {
    'Order_ID': range(1001, 1001 + n_records),
    'Product': np.random.choice(
        ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Charger', 'Cable'],
        n_records
    ),
    'Quantity': np.random.randint(1, 5, n_records),
    'Price': np.random.choice([999, 699, 499, 149, 29, 15], n_records),
    'City': np.random.choice(
        ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
        n_records
    ),
    'Month': np.random.choice(
        ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
         'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
        n_records
    )
}

df = pd.DataFrame(data)

# Calculate total sale amount
df['Total'] = df['Quantity'] * df['Price']

print("Data created successfully!")
print(f"Total records: {len(df)}")

Step 3: Explore the Data

Let's understand what we have:

code.py
# First look at the data
print("=== First 5 Rows ===")
print(df.head())
Order_IDProductQuantityPriceCityMonthTotal
1001Laptop2999ChicagoMar1998
1002Phone1699HoustonJul699
1003Cable315New YorkJan45
1004Headphones2149PhoenixNov298
1005Tablet1499Los AngelesMay499
code.py
# Data info
print("\n=== Data Info ===")
print(df.info())
ColumnTypeNon-Null
Order_IDint641000
Productobject1000
Quantityint641000
Priceint641000
Cityobject1000
Monthobject1000
Totalint641000
code.py
# Basic statistics
print("\n=== Statistics ===")
print(df.describe())
StatQuantityPriceTotal
count100010001000
mean2.5398995
std1.13561124
min11515
max49993996

Step 4: Data Cleaning

Check for issues:

code.py
# Check for missing values
print("=== Missing Values ===")
print(df.isnull().sum())
ColumnMissing
Order_ID0
Product0
Quantity0
Price0
City0
Month0
Total0
code.py
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

# Check unique values
print("\n=== Unique Values ===")
print(f"Products: {df['Product'].nunique()}")
print(f"Cities: {df['City'].nunique()}")
print(f"Months: {df['Month'].nunique()}")

Our data is clean! No missing values or duplicates.


Step 5: Sales by Product

Question: Which products sell the most?

code.py
# Total sales by product
product_sales = df.groupby('Product')['Total'].sum().sort_values(ascending=False)

print("=== Sales by Product ===")
print(product_sales)
ProductTotal Sales ($)
Laptop312,687
Phone245,790
Tablet178,143
Headphones52,318
Charger10,324
Cable5,670
code.py
# Visualize
plt.figure(figsize=(10, 6))
product_sales.plot(kind='bar', color='steelblue', edgecolor='black')
plt.title('Total Sales by Product', fontsize=14, fontweight='bold')
plt.xlabel('Product')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Sales by Product

Insight: Laptops generate the most revenue, followed by Phones and Tablets.


Step 6: Sales by City

Question: Which cities generate most revenue?

code.py
# Sales by city
city_sales = df.groupby('City')['Total'].sum().sort_values(ascending=False)

print("=== Sales by City ===")
print(city_sales)
CityTotal Sales ($)
New York198,450
Los Angeles185,230
Chicago167,890
Houston152,340
Phoenix100,022
code.py
# Pie chart for city distribution
plt.figure(figsize=(8, 8))
plt.pie(city_sales, labels=city_sales.index, autopct='%1.1f%%',
        colors=sns.color_palette('pastel'), startangle=90)
plt.title('Sales Distribution by City', fontsize=14, fontweight='bold')
plt.show()

Sales by City

Insight: New York and Los Angeles are the top markets.


Step 7: Monthly Sales Trend

Question: Which months have highest sales?

code.py
# Define month order
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Sales by month
monthly_sales = df.groupby('Month')['Total'].sum()
monthly_sales = monthly_sales.reindex(month_order)

print("=== Monthly Sales ===")
print(monthly_sales)
MonthTotal Sales ($)
Jan62,340
Feb58,120
Mar71,450
Apr65,890
May69,230
Jun72,100
Jul78,450
Aug75,320
Sep68,900
Oct71,230
Nov85,670
Dec95,232
code.py
# Line chart for monthly trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales.index, monthly_sales.values, marker='o',
         linewidth=2, markersize=8, color='coral')
plt.fill_between(monthly_sales.index, monthly_sales.values, alpha=0.3, color='coral')
plt.title('Monthly Sales Trend', fontsize=14, fontweight='bold')
plt.xlabel('Month')
plt.ylabel('Total Sales ($)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Monthly Sales Trend

Insight: Sales peak in November and December (holiday season).


Step 8: Product Performance by City

Question: Do different cities prefer different products?

code.py
# Create pivot table
pivot = df.pivot_table(values='Total', index='City',
                       columns='Product', aggfunc='sum')

print("=== Sales: City vs Product ===")
print(pivot)
CityCableChargerHeadphonesLaptopPhoneTablet
Chicago1,0202,03010,43058,32052,10035,990
Houston1,2001,8909,87062,45045,32031,610
Los Angeles1,3502,45011,23068,90055,80045,500
New York1,5002,10012,45072,34058,23040,830
Phoenix6001,8548,33850,67734,34024,213
code.py
# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd')
plt.title('Sales Heatmap: City vs Product', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Sales Heatmap

Insight: Laptops sell best across all cities. New York leads in most categories.


Step 9: Order Size Analysis

Question: What's the typical order size?

code.py
# Order quantity distribution
print("=== Quantity Distribution ===")
print(df['Quantity'].value_counts().sort_index())
QuantityCount
1248
2256
3251
4245
code.py
# Histogram
plt.figure(figsize=(8, 5))
df['Quantity'].hist(bins=4, edgecolor='black', color='lightgreen')
plt.title('Order Quantity Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Quantity')
plt.ylabel('Number of Orders')
plt.tight_layout()
plt.show()

Insight: Order quantities are evenly distributed (1-4 items per order).


Step 10: Key Metrics Summary

Let's create a summary dashboard:

code.py
# Calculate key metrics
total_revenue = df['Total'].sum()
total_orders = len(df)
avg_order_value = df['Total'].mean()
top_product = df.groupby('Product')['Total'].sum().idxmax()
top_city = df.groupby('City')['Total'].sum().idxmax()
best_month = df.groupby('Month')['Total'].sum().idxmax()

print("=" * 50)
print("        SALES DASHBOARD SUMMARY")
print("=" * 50)
print(f"Total Revenue:      ${total_revenue:,.2f}")
print(f"Total Orders:       {total_orders:,}")
print(f"Avg Order Value:    ${avg_order_value:,.2f}")
print(f"Top Product:        {top_product}")
print(f"Top City:           {top_city}")
print(f"Best Month:         {best_month}")
print("=" * 50)
MetricValue
Total Revenue$804,932
Total Orders1,000
Avg Order Value$804.93
Top ProductLaptop
Top CityNew York
Best MonthDecember

Dashboard Summary


Step 11: Save Your Analysis

code.py
# Save cleaned data
df.to_csv('sales_analysis_results.csv', index=False)

# Save summary to text file
with open('sales_summary.txt', 'w') as f:
    f.write("SALES ANALYSIS SUMMARY\n")
    f.write("=" * 40 + "\n")
    f.write(f"Total Revenue: ${total_revenue:,.2f}\n")
    f.write(f"Total Orders: {total_orders}\n")
    f.write(f"Top Product: {top_product}\n")
    f.write(f"Top City: {top_city}\n")
    f.write(f"Best Month: {best_month}\n")

print("Files saved successfully!")

Business Recommendations

Based on our analysis:

FindingRecommendation
Laptops = highest revenueFocus marketing on laptop promotions
Dec & Nov = peak salesPlan inventory for holiday season
New York = top marketConsider expanding operations there
Low cable/charger salesBundle accessories with main products
Even order quantityNo need to push larger orders

Complete Code

Here's the full code in one place:

code.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setup
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

# Create data
n_records = 1000
data = {
    'Order_ID': range(1001, 1001 + n_records),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', 'Charger', 'Cable'], n_records),
    'Quantity': np.random.randint(1, 5, n_records),
    'Price': np.random.choice([999, 699, 499, 149, 29, 15], n_records),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_records),
    'Month': np.random.choice(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], n_records)
}
df = pd.DataFrame(data)
df['Total'] = df['Quantity'] * df['Price']

# Analysis
print("Top 5 Products by Revenue:")
print(df.groupby('Product')['Total'].sum().sort_values(ascending=False))

print("\nTop Cities by Revenue:")
print(df.groupby('City')['Total'].sum().sort_values(ascending=False))

print("\nMonthly Sales:")
print(df.groupby('Month')['Total'].sum())

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Product sales
df.groupby('Product')['Total'].sum().sort_values().plot(kind='barh', ax=axes[0,0], color='steelblue')
axes[0,0].set_title('Sales by Product')

# City sales
df.groupby('City')['Total'].sum().plot(kind='pie', ax=axes[0,1], autopct='%1.1f%%')
axes[0,1].set_title('Sales by City')

# Monthly trend
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly = df.groupby('Month')['Total'].sum().reindex(month_order)
axes[1,0].plot(monthly.index, monthly.values, marker='o', color='coral')
axes[1,0].set_title('Monthly Sales Trend')
axes[1,0].tick_params(axis='x', rotation=45)

# Heatmap
pivot = df.pivot_table(values='Total', index='City', columns='Product', aggfunc='sum')
sns.heatmap(pivot, ax=axes[1,1], cmap='YlOrRd', annot=True, fmt='.0f')
axes[1,1].set_title('City vs Product Sales')

plt.tight_layout()
plt.savefig('sales_dashboard.png', dpi=300)
plt.show()

print("\nProject Complete!")

What You Learned

  • Loading and exploring data with Pandas
  • Cleaning and validating data
  • Grouping and aggregating data
  • Creating various visualizations
  • Drawing business insights from data
  • Saving results and reports

Congratulations! You've completed your first data analysis project!

What's Next?

Try the Customer Churn Prediction project to learn machine learning.