What is Data Modeling — Star Schema vs Snowflake Schema?

Master dimensional modeling for data warehouses. Learn star schema, snowflake schema, fact tables, dimension tables with real examples from Flipkart and Swiggy.

Is Data Modeling — Star Schema vs Snowflake Schema suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 11 min to complete and includes 7 interactive quizzes to test your understanding.

How long does it take to learn Data Modeling — Star Schema vs Snowflake Schema?

You can complete this topic in about 11 min. The topic is part 56 of undefined in our comprehensive Data Analytics Learning Path.

Star Schema vs Snowflake Schema: Data Modeling Guide | DataPath

📐

What is Dimensional Modeling?

Dimensional modeling is a design technique for structuring data warehouses to optimize for analytics queries (not transactions). It organizes data into fact tables (metrics) and dimension tables (attributes).

Why Dimensional Modeling?

OLTP databases (MySQL, PostgreSQL) use normalized schemas (3NF) to eliminate redundancy and speed up writes:

orders (order_id, customer_id, product_id, amount)
customers (customer_id, name, city_id)
cities (city_id, city_name, state_id)
states (state_id, state_name)

Problem for analytics: To find "revenue by state", you must join 4 tables:

query.sqlSQL

SELECT s.state_name, SUM(o.amount)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN cities ci ON c.city_id = ci.city_id
JOIN states s ON ci.state_id = s.state_id
GROUP BY s.state_name;

This is slow for billions of rows.

Dimensional modeling denormalizes data to reduce joins:

fact_orders (order_id, customer_id, product_id, amount, city, state)
dim_customers (customer_id, name, city, state, tier)

Same query, simpler:

query.sqlSQL

SELECT state, SUM(amount)
FROM fact_orders
GROUP BY state;

Trade-off: Storage increases (state is duplicated across millions of orders), but queries are 10-100x faster.

Think of it this way...

Normalized schemas are like filing documents in folders by category (space-efficient but slow to find). Dimensional modeling is like keeping copies of frequently accessed documents on your desk (uses more space but instant access).

🔢

Fact Tables vs Dimension Tables

Fact Tables — Measurable Events

Fact tables store quantitative, measurable events (transactions, clicks, orders).

Characteristics:

Many rows (millions to billions)
Numeric measures (amount, quantity, duration)
Foreign keys to dimension tables
Additive facts (can be summed: revenue, quantity)

Example — Flipkart's fact_orders:

| order_id | customer_id | product_id | date_id | city_id | amount | quantity | |----------|-------------|------------|---------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 10 | 1200 | 2 | | 1002 | 502 | 203 | 20260301 | 11 | 850 | 1 | | 1003 | 501 | 205 | 20260302 | 10 | 2400 | 1 |

Measures: amount, quantity Foreign keys: customer_id, product_id, date_id, city_id

Dimension Tables — Descriptive Attributes

Dimension tables store descriptive, categorical attributes about who, what, where, when.

Characteristics:

Fewer rows (thousands to millions)
Text/categorical columns (names, categories, dates)
Primary key referenced by fact tables
Slowly changing (updated infrequently)

Example — dim_customers:

| customer_id | name | email | city | state | tier | |-------------|------|-------|------|-------|------| | 501 | Amit Sharma | [email protected] | Mumbai | Maharashtra | Gold | | 502 | Priya Singh | [email protected] | Bangalore | Karnataka | Silver |

Example — dim_products:

| product_id | product_name | category | brand | price | |------------|--------------|----------|-------|-------| | 201 | iPhone 15 | Electronics | Apple | 79999 | | 203 | Nike Shoes | Fashion | Nike | 4500 | | 205 | MacBook Pro | Electronics | Apple | 199999 |

Example — dim_date:

| date_id | date | day_of_week | month | quarter | year | is_holiday | |---------|------|-------------|-------|---------|------|------------| | 20260301 | 2026-03-01 | Sunday | March | Q1 | 2026 | false | | 20260302 | 2026-03-02 | Monday | March | Q1 | 2026 | false |

Info

Kimball's Golden Rule: If it's a number you want to analyze (sum, average, count), it's a fact. If it's a label or category you want to group/filter by, it's a dimension.

⚠️ CheckpointQuiz error: Missing or invalid options array

⭐

Star Schema — Denormalized for Speed

Star schema is the most common dimensional model. It has:

One central fact table (the "center of the star")
Multiple dimension tables (the "points of the star")
Denormalized dimensions (all attributes in one table, no sub-dimensions)

Structure

         dim_date
             |
             |
dim_customer — fact_orders — dim_product
             |
             |
          dim_city

Example — E-commerce Star Schema

Fact Table: fact_orders

Dimension Tables:

dim_customers (denormalized — city, state in one table):

| customer_id | name | city | state | tier | |-------------|------|------|-------|------| | 501 | Amit Sharma | Mumbai | Maharashtra | Gold | | 502 | Priya Singh | Bangalore | Karnataka | Silver |

dim_products (denormalized — brand, category in one table):

dim_date:

| date_id | date | month | quarter | year | |---------|------|-------|---------|------| | 20260301 | 2026-03-01 | March | Q1 | 2026 |

dim_cities:

| city_id | city | state | region | |---------|------|-------|--------| | 10 | Mumbai | Maharashtra | West | | 11 | Bangalore | Karnataka | South |

Querying the Star Schema

query.sqlSQL

-- Revenue by product category and city in Q1 2026
SELECT
  p.category,
  c.city,
  SUM(f.amount) AS total_revenue,
  COUNT(f.order_id) AS order_count
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_cities c ON f.city_id = c.city_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY p.category, c.city
ORDER BY total_revenue DESC;

Performance:

Only 4 tables joined (vs 10+ in normalized schema)
Dimension tables are small (cached in memory)
Fact table is large but scanned once

Benefits of Star Schema

Fast queries (fewer joins)
Simple to understand (business users can navigate easily)
Predictable performance (query patterns are similar)
BI tool friendly (Tableau, Power BI work best with star schemas)

Drawbacks

Storage overhead (duplicated dimension data)
Update complexity (changing "Nike" to "Nike Inc" requires updating millions of fact rows if denormalized)

Think of it this way...

Zomato's data warehouse uses a star schema:

Fact table: fact_orders (order_id, restaurant_id, customer_id, amount, delivery_time)
Dimensions: dim_restaurants (name, city, cuisine), dim_customers (name, city, tier), dim_date

Dashboards query this schema to show "revenue by cuisine by city" with minimal joins.

❄️

Snowflake Schema — Normalized for Storage

Snowflake schema is like a star schema, but dimensions are normalized into sub-dimensions (hierarchies).

Structure

                    dim_date
                       |
       dim_brand       |       dim_category
            \          |          /
         dim_product   |   dim_state
                  \    |    /
                  fact_orders
                  /    |    \
         dim_customer  |   dim_city
                       |
                   dim_tier

Key difference: Dimension tables are split into normalized hierarchies.

Example — Snowflake Schema

Fact Table: fact_orders (same as star):

| order_id | customer_id | product_id | date_id | amount | quantity | |----------|-------------|------------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 1200 | 2 |

Normalized Dimension Tables:

dim_products (normalized — references dim_brand, dim_category):

| product_id | product_name | brand_id | category_id | price | |------------|--------------|----------|-------------|-------| | 201 | iPhone 15 | 1 | 10 | 79999 | | 203 | Nike Shoes | 5 | 20 | 4500 |

dim_brands (sub-dimension):

| brand_id | brand_name | |----------|------------| | 1 | Apple | | 5 | Nike |

dim_categories (sub-dimension):

| category_id | category_name | |-------------|---------------| | 10 | Electronics | | 20 | Fashion |

dim_customers (normalized — references dim_cities):

| customer_id | name | email | city_id | tier_id | |-------------|------|-------|---------|---------| | 501 | Amit Sharma | [email protected] | 10 | 1 | | 502 | Priya Singh | [email protected] | 11 | 2 |

dim_cities (sub-dimension):

| city_id | city | state_id | |---------|------|----------| | 10 | Mumbai | 1 | | 11 | Bangalore | 2 |

dim_states (sub-dimension):

| state_id | state | region | |----------|-------|--------| | 1 | Maharashtra | West | | 2 | Karnataka | South |

Querying the Snowflake Schema

query.sqlSQL

-- Same query as star schema, but MORE joins
SELECT
  cat.category_name,
  ci.city,
  SUM(f.amount) AS total_revenue
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_categories cat ON p.category_id = cat.category_id  -- Extra join
JOIN dim_customers cu ON f.customer_id = cu.customer_id
JOIN dim_cities ci ON cu.city_id = ci.city_id  -- Extra join
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY cat.category_name, ci.city;

More joins = slower queries (but less storage duplication).

Benefits of Snowflake Schema

Less storage (no duplication of brand names, state names)
Easier updates (changing "Nike" to "Nike Inc" updates one row in dim_brands)

Drawbacks

Slower queries (more joins)
Complex to navigate (business users struggle with 10+ table joins)
BI tools struggle (many BI tools expect star schemas)

Info

Modern Best Practice: Use star schema by default. Storage is cheap (cloud warehouses like BigQuery charge $0.02/GB/month). Query speed is more valuable than saving a few GB.

🆚

Star vs Snowflake — When to Use Each

| Aspect | Star Schema | Snowflake Schema | |--------|-------------|------------------| | Dimension structure | Denormalized (flat) | Normalized (hierarchies) | | Number of joins | Fewer (3-5 typically) | More (10+ possible) | | Query performance | Faster | Slower | | Storage space | More (duplicated data) | Less (no duplication) | | Update complexity | Higher (update many rows) | Lower (update one row) | | BI tool compatibility | Excellent | Good | | Business user friendliness | High (simple joins) | Low (complex joins) | | Use case | Most data warehouses | Legacy systems, extreme storage constraints |

Decision Guide

Use Star Schema when:

Query performance is critical
Using cloud warehouses (storage is cheap)
Business users query the warehouse directly
Using BI tools (Tableau, Power BI, Looker)
This is 95% of modern data warehouses

Use Snowflake Schema when:

Storage is extremely expensive (legacy on-prem systems)
Dimension updates are frequent and must be efficient
Data governance requires strict normalization
This is rare in modern cloud warehouses

Hybrid Approach — "Snowflake-ish Star"

Some warehouses use a hybrid:

Denormalize high-cardinality dimensions (products, customers) for speed
Normalize low-cardinality lookup tables (countries, currencies) to save space

Example — Flipkart's hybrid model:

Denormalized (star):

dim_products includes brand_name, category_name directly (queried frequently)

Normalized (snowflake):

dim_currencies split from transactions (only 5 currencies, rarely queried)

Think of it this way...

Swiggy uses a pure star schema in BigQuery:

fact_orders (200M rows daily)
dim_restaurants (50K rows, includes city, cuisine denormalized)
dim_customers (10M rows, includes city denormalized)

Storage cost: $100/month. Query speed: 10x faster than snowflake schema. Clear trade-off in favor of speed.

🔄

Slowly Changing Dimensions (SCD)

Problem: What happens when dimension attributes change?

Example: A customer moves from Mumbai to Bangalore. If we update dim_customers.city from "Mumbai" to "Bangalore", historical orders show incorrect city retroactively.

Solution: Slowly Changing Dimension (SCD) strategies.

SCD Type 1 — Overwrite (No History)

Strategy: Overwrite the old value. Lose history.

Before: | customer_id | name | city | |-------------|------|------| | 501 | Amit | Mumbai |

After customer moves: | customer_id | name | city | |-------------|------|------| | 501 | Amit | Bangalore |

Effect: All historical orders now show city = "Bangalore" (even orders placed when he lived in Mumbai).

Use case: Correcting data errors, when history doesn't matter.

SCD Type 2 — Add New Row (Preserve History)

Strategy: Add a new row with a new surrogate key. Keep old row.

Before: | customer_key | customer_id | name | city | valid_from | valid_to | is_current | |--------------|-------------|------|------|------------|----------|------------| | 1 | 501 | Amit | Mumbai | 2024-01-01 | 9999-12-31 | true |

After customer moves: | customer_key | customer_id | name | city | valid_from | valid_to | is_current | |--------------|-------------|------|------|------------|----------|------------| | 1 | 501 | Amit | Mumbai | 2024-01-01 | 2026-03-22 | false | | 2 | 501 | Amit | Bangalore | 2026-03-23 | 9999-12-31 | true |

Effect: Orders before 2026-03-22 link to customer_key = 1 (Mumbai). Orders after link to customer_key = 2 (Bangalore).

Use case: When historical accuracy matters (most warehouses use this).

SCD Type 3 — Add New Column (Limited History)

Strategy: Add a column for previous value.

| customer_id | name | current_city | previous_city | |-------------|------|--------------|---------------| | 501 | Amit | Bangalore | Mumbai |

Use case: Rare — only tracks one previous value.

Info

Modern Best Practice: Use SCD Type 2 for dimensions where history matters (customers, products). Use SCD Type 1 for corrections (fixing typos) or dimensions where history doesn't matter (currency exchange rates updated daily).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}