Topic 32 of

Python Project — Zomato Restaurant Analysis

Theory is easy. Messy real-world data is hard. This project gives you a complete Zomato dataset and a step-by-step path from raw CSV to insights — just like a real analyst's day.

📚Intermediate
⏱️18 min
10 quizzes
🎯

Project Overview — What We'll Build

In this project, you'll analyze real restaurant data from Zomato to answer business questions:

Key Questions We'll Answer:

  1. Which cities have the highest-rated restaurants?
  2. What's the relationship between price range and rating?
  3. Which cuisines are most popular in each city?
  4. Do restaurants accepting online orders have higher ratings?
  5. What factors predict a restaurant's success?

Skills You'll Practice:

  • Loading and exploring messy CSV data
  • Handling missing values and data type issues
  • Cleaning text data (restaurant names, cuisines)
  • Grouping and aggregating by multiple dimensions
  • Creating visualizations to communicate insights
  • Drawing business conclusions from data

Tools:

  • Pandas for data manipulation
  • NumPy for numerical operations
  • Matplotlib and Seaborn for visualization

Dataset Information

Source: Kaggle — Zomato Bangalore Restaurants

Size: ~56,000 restaurants from Bangalore

Columns:

  • name — Restaurant name
  • online_order — Accepts online orders (Yes/No)
  • book_table — Table booking available (Yes/No)
  • rate — Average rating (e.g., "4.1/5", "3.8 /5")
  • votes — Number of votes/reviews
  • location — Area/neighborhood
  • rest_type — Restaurant type (Casual Dining, Cafe, etc.)
  • cuisines — Cuisines offered (comma-separated)
  • approx_cost(for two people) — Estimated cost for two
  • listed_in(type) — Meal type (Delivery, Dine-out, etc.)

Download: Get the dataset from Kaggle or use this direct CSV link (example mirror).

📥

Step 1 — Setup and Load Data

Install Required Libraries

$ terminalBash
pip install pandas numpy matplotlib seaborn

Load the Dataset

code.pyPython
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
df = pd.read_csv('zomato.csv')

# First look at the data
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

Expected Output:

name online_order book_table rate votes \ 0 Jalsa Yes Yes 4.1/5 775 1 Spice Elephant Yes No 4.1/5 787 2 San Churro Yes No 3.8/5 918 3 Addhuri Udupi Bhojana No No 3.7/5 88 4 Grand Village No No 3.8/5 166 location rest_type cuisines approx_cost(for two people) \ 0 Banashankari Casual Dining North Indian, Mughlai, Chinese 800 1 Banashankari Casual Dining North Indian, Chinese, Biryani 800 2 Banashankari Cafe, Casual Dining Cafe, Mexican 800 3 Banashankari Quick Bites South Indian, North Indian 300 4 Basavanagudi Casual Dining North Indian, Rajasthani 600 listed_in(type) 0 Buffet 1 Buffet 2 Buffet 3 Delivery 4 Dine-out Dataset shape: (51717, 11)

Get Initial Insights

code.pyPython
# Data info
print(df.info())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Statistical summary
print("\n" + "="*50)
print(df.describe())
🧹

Step 2 — Clean the Data

Real-world data is messy. Let's clean it systematically.

Fix the Rating Column

The rate column has formats like "4.1/5", "NEW", "-", "3.8 /5". Let's standardize it:

code.pyPython
# View unique rating formats
print(df['rate'].value_counts().head(10))

# Clean ratings: extract numeric value
def clean_rating(rate):
    if pd.isna(rate):
        return np.nan
    if rate == 'NEW' or rate == '-':
        return np.nan
    # Extract numeric part (e.g., "4.1/5" → 4.1)
    try:
        return float(rate.split('/')[0].strip())
    except:
        return np.nan

df['rating'] = df['rate'].apply(clean_rating)

# Check the result
print(f"\nRatings cleaned. Sample values:")
print(df[['rate', 'rating']].head(10))

# Drop original rate column
df = df.drop(columns=['rate'])

Fix the Cost Column

code.pyPython
# Current format: "800" (string with commas for large values)
print(df['approx_cost(for two people)'].value_counts().head())

# Clean cost: remove commas, convert to numeric
df['cost_for_two'] = pd.to_numeric(
    df['approx_cost(for two people)'].str.replace(',', ''),
    errors='coerce'
)

# Drop original column
df = df.drop(columns=['approx_cost(for two people)'])

# Check for outliers
print(f"\nCost statistics:")
print(df['cost_for_two'].describe())

Handle Missing Values

code.pyPython
# Missing value summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing': missing,
    'Percentage': missing_pct
}).sort_values('Missing', ascending=False)

print(missing_df[missing_df['Missing'] > 0])

# Strategy:
# - rating: Drop rows (can't analyze restaurants without ratings)
# - cost_for_two: Fill with median by rest_type
# - cuisines: Fill with "Not Specified"

# Drop rows with missing ratings
df = df.dropna(subset=['rating'])

# Fill missing costs with median by restaurant type
df['cost_for_two'] = df.groupby('rest_type')['cost_for_two'].transform(
    lambda x: x.fillna(x.median())
)

# Fill missing cuisines
df['cuisines'] = df['cuisines'].fillna('Not Specified')

print(f"\nAfter cleaning: {df.shape[0]} rows remaining")

Create Additional Columns

code.pyPython
# Binary flags
df['accepts_online_orders'] = (df['online_order'] == 'Yes').astype(int)
df['table_booking'] = (df['book_table'] == 'Yes').astype(int)

# Price category
df['price_category'] = pd.cut(
    df['cost_for_two'],
    bins=[0, 300, 600, 1000, 10000],
    labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)

# Rating category
df['rating_category'] = pd.cut(
    df['rating'],
    bins=[0, 2.5, 3.5, 4.0, 5.0],
    labels=['Poor', 'Average', 'Good', 'Excellent']
)

print(df[['name', 'rating', 'rating_category', 'cost_for_two', 'price_category']].head())

⚠️ CheckpointQuiz error: Missing or invalid options array

🔍

Step 3 — Exploratory Data Analysis

Now that data is clean, let's answer our business questions.

Q1: Which locations have the highest-rated restaurants?

code.pyPython
# Top 15 locations by average rating (min 100 restaurants)
location_stats = df.groupby('location').agg({
    'rating': 'mean',
    'name': 'count'
}).rename(columns={'name': 'restaurant_count'})

top_locations = location_stats[location_stats['restaurant_count'] >= 100].sort_values(
    'rating', ascending=False
).head(15)

print(top_locations)

# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(data=top_locations.reset_index(), x='location', y='rating', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Top 15 Locations by Average Restaurant Rating', fontsize=14, fontweight='bold')
plt.xlabel('Location')
plt.ylabel('Average Rating')
plt.axhline(df['rating'].mean(), color='red', linestyle='--', label=f'City Average: {df["rating"].mean():.2f}')
plt.legend()
plt.tight_layout()
plt.show()

Q2: Price vs Rating — Do expensive restaurants rate higher?

code.pyPython
# Average rating by price category
price_rating = df.groupby('price_category')['rating'].agg(['mean', 'median', 'count'])
print(price_rating)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
sns.boxplot(data=df, x='price_category', y='rating', palette='Set2', ax=axes[0])
axes[0].set_title('Rating Distribution by Price Category', fontweight='bold')
axes[0].set_xlabel('Price Category')
axes[0].set_ylabel('Rating')

# Scatter plot
axes[1].scatter(df['cost_for_two'], df['rating'], alpha=0.3, s=10)
axes[1].set_xlabel('Cost for Two (₹)')
axes[1].set_ylabel('Rating')
axes[1].set_title('Cost vs Rating — Scatter Plot', fontweight='bold')
axes[1].axhline(df['rating'].mean(), color='red', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Correlation
print(f"\nCorrelation between cost and rating: {df['cost_for_two'].corr(df['rating']):.3f}")

Q3: Most popular cuisines

code.pyPython
# Cuisines are comma-separated. Split and count.
from collections import Counter

all_cuisines = []
for cuisines_str in df['cuisines'].dropna():
    cuisines = [c.strip() for c in cuisines_str.split(',')]
    all_cuisines.extend(cuisines)

cuisine_counts = Counter(all_cuisines).most_common(15)

# Convert to DataFrame
cuisine_df = pd.DataFrame(cuisine_counts, columns=['Cuisine', 'Count'])

# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(data=cuisine_df, x='Count', y='Cuisine', palette='magma')
plt.title('Top 15 Most Popular Cuisines in Bangalore', fontsize=14, fontweight='bold')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.tight_layout()
plt.show()

Q4: Online orders vs ratings

code.pyPython
# Compare ratings: online vs no online
online_comparison = df.groupby('online_order')['rating'].agg(['mean', 'median', 'count'])
print(online_comparison)

# Statistical test (t-test)
from scipy import stats

online_yes = df[df['online_order'] == 'Yes']['rating']
online_no = df[df['online_order'] == 'No']['rating']

t_stat, p_value = stats.ttest_ind(online_yes, online_no)
print(f"\nT-test: t={t_stat:.3f}, p-value={p_value:.4f}")
if p_value < 0.05:
    print("Significant difference! Restaurants with online orders have different ratings.")

# Visualize
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='online_order', y='rating', palette='Set1')
plt.title('Rating Distribution: Online Orders vs No Online Orders', fontsize=14, fontweight='bold')
plt.xlabel('Accepts Online Orders')
plt.ylabel('Rating')
plt.tight_layout()
plt.show()

Q5: Multi-factor analysis

code.pyPython
# Rating by price category and online orders
pivot = df.pivot_table(
    values='rating',
    index='price_category',
    columns='online_order',
    aggfunc='mean'
)

print(pivot)

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(pivot, annot=True, cmap='YlGnBu', fmt='.2f', linewidths=1)
plt.title('Average Rating: Price Category vs Online Orders', fontsize=14, fontweight='bold')
plt.xlabel('Accepts Online Orders')
plt.ylabel('Price Category')
plt.tight_layout()
plt.show()
💡

Step 4 — Key Insights and Conclusions

After completing the analysis, summarize findings for stakeholders.

Key Findings

1. Location Matters

  • Premium neighborhoods (Koramangala, Indiranagar) have higher average ratings (4.0+)
  • Emerging areas have more variability in quality
  • Recommendation: Target expansion in proven high-rating locations

2. Price ≠ Quality

  • Weak correlation between cost and rating (r ≈ 0.15)
  • Budget restaurants can achieve excellent ratings with good execution
  • Luxury doesn't guarantee satisfaction

3. Online Orders = Higher Ratings

  • Restaurants accepting online orders: 3.95 average
  • No online orders: 3.65 average
  • Statistically significant (p < 0.001)
  • Recommendation: Encourage online ordering adoption

4. North Indian Dominates

  • North Indian cuisine is most common (8,000+ restaurants)
  • Followed by Chinese, South Indian, Fast Food
  • Niche cuisines (Italian, Continental) are underserved — opportunity?

5. Table Booking Correlates with Higher Ratings

  • Table booking available: 4.05 average
  • No table booking: 3.75 average
  • Suggests restaurants investing in service get rewarded

Business Recommendations

For Zomato:

  1. Incentivize restaurants to enable online orders (data shows rating boost)
  2. Focus acquisition efforts on high-rating neighborhoods
  3. Help budget restaurants market their quality (price doesn't predict rating)

For Restaurant Owners:

  1. Enable online orders — it correlates with 0.3 higher rating
  2. Invest in table booking systems for dine-in restaurants
  3. Location strategy: operate in proven neighborhoods or differentiate in emerging areas
📝

Complete Code — All in One Place

Here's the entire analysis in one runnable script:

code.pyPython
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from collections import Counter

# Setup
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
df = pd.read_csv('zomato.csv')

# Clean rating
def clean_rating(rate):
    if pd.isna(rate) or rate in ['NEW', '-']:
        return np.nan
    try:
        return float(rate.split('/')[0].strip())
    except:
        return np.nan

df['rating'] = df['rate'].apply(clean_rating)

# Clean cost
df['cost_for_two'] = pd.to_numeric(
    df['approx_cost(for two people)'].str.replace(',', ''),
    errors='coerce'
)

# Handle missing
df = df.dropna(subset=['rating'])
df['cost_for_two'] = df.groupby('rest_type')['cost_for_two'].transform(
    lambda x: x.fillna(x.median())
)
df['cuisines'] = df['cuisines'].fillna('Not Specified')

# Feature engineering
df['accepts_online_orders'] = (df['online_order'] == 'Yes').astype(int)
df['price_category'] = pd.cut(
    df['cost_for_two'],
    bins=[0, 300, 600, 1000, 10000],
    labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)

# Analysis 1: Top locations
location_stats = df.groupby('location').agg({
    'rating': 'mean',
    'name': 'count'
}).rename(columns={'name': 'count'})
top_locations = location_stats[location_stats['count'] >= 100].sort_values(
    'rating', ascending=False
).head(15)

plt.figure(figsize=(12, 6))
sns.barplot(data=top_locations.reset_index(), x='location', y='rating', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Top Locations by Average Rating')
plt.tight_layout()
plt.savefig('top_locations.png', dpi=300)
plt.show()

# Analysis 2: Price vs Rating
print(f"Cost-Rating Correlation: {df['cost_for_two'].corr(df['rating']):.3f}")

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='price_category', y='rating', palette='Set2')
plt.title('Rating by Price Category')
plt.savefig('price_rating.png', dpi=300)
plt.show()

# Analysis 3: Online orders
online_yes = df[df['online_order'] == 'Yes']['rating']
online_no = df[df['online_order'] == 'No']['rating']
t_stat, p_value = stats.ttest_ind(online_yes, online_no)

print(f"\nOnline Orders Impact:")
print(f"  With online: {online_yes.mean():.2f}")
print(f"  Without online: {online_no.mean():.2f}")
print(f"  T-test p-value: {p_value:.4f}")

# Save cleaned data
df.to_csv('zomato_cleaned.csv', index=False)
print("\nCleaned data saved to: zomato_cleaned.csv")

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}


Extension Challenges

Ready to take this project further? Try these:

1. Cuisine Combination Analysis

  • Which cuisine combinations (e.g., "Chinese, North Indian") are most popular?
  • Do multi-cuisine restaurants rate higher or lower than specialists?
  • Create a network graph showing cuisine co-occurrence

2. Location Clustering

  • Group similar locations using restaurant features (avg rating, cost, cuisine mix)
  • Use K-means clustering to identify "restaurant neighborhood archetypes"
  • Visualize clusters on a map (if you add latitude/longitude data)

3. Predictive Modeling

  • Build a regression model to predict restaurant rating from features (cost, location, online orders, cuisines)
  • Which features matter most? (Use feature importance from Random Forest)
  • Can you predict success for a new restaurant concept?

4. Time-Series Analysis

  • If you can find historical Zomato data (ratings over time), analyze trends
  • Do restaurants decline in rating after initial hype?
  • Identify restaurants improving vs declining

5. Sentiment Analysis

  • Scrape restaurant reviews (check Zomato's terms of service)
  • Use NLP to analyze review sentiment
  • Does text sentiment correlate with numeric ratings?

Add these to your portfolio to stand out!