What is Dimensional Modeling?
Dimensional modeling is a design technique for structuring data warehouses to optimize for analytics queries (not transactions). It organizes data into fact tables (metrics) and dimension tables (attributes).
Why Dimensional Modeling?
OLTP databases (MySQL, PostgreSQL) use normalized schemas (3NF) to eliminate redundancy and speed up writes:
orders (order_id, customer_id, product_id, amount)
customers (customer_id, name, city_id)
cities (city_id, city_name, state_id)
states (state_id, state_name)
Problem for analytics: To find "revenue by state", you must join 4 tables:
SELECT s.state_name, SUM(o.amount)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN cities ci ON c.city_id = ci.city_id
JOIN states s ON ci.state_id = s.state_id
GROUP BY s.state_name;This is slow for billions of rows.
Dimensional modeling denormalizes data to reduce joins:
fact_orders (order_id, customer_id, product_id, amount, city, state)
dim_customers (customer_id, name, city, state, tier)
Same query, simpler:
SELECT state, SUM(amount)
FROM fact_orders
GROUP BY state;Trade-off: Storage increases (state is duplicated across millions of orders), but queries are 10-100x faster.
Normalized schemas are like filing documents in folders by category (space-efficient but slow to find). Dimensional modeling is like keeping copies of frequently accessed documents on your desk (uses more space but instant access).
Fact Tables vs Dimension Tables
Fact Tables — Measurable Events
Fact tables store quantitative, measurable events (transactions, clicks, orders).
Characteristics:
- Many rows (millions to billions)
- Numeric measures (amount, quantity, duration)
- Foreign keys to dimension tables
- Additive facts (can be summed: revenue, quantity)
Example — Flipkart's fact_orders:
| order_id | customer_id | product_id | date_id | city_id | amount | quantity | |----------|-------------|------------|---------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 10 | 1200 | 2 | | 1002 | 502 | 203 | 20260301 | 11 | 850 | 1 | | 1003 | 501 | 205 | 20260302 | 10 | 2400 | 1 |
Measures: amount, quantity
Foreign keys: customer_id, product_id, date_id, city_id
Dimension Tables — Descriptive Attributes
Dimension tables store descriptive, categorical attributes about who, what, where, when.
Characteristics:
- Fewer rows (thousands to millions)
- Text/categorical columns (names, categories, dates)
- Primary key referenced by fact tables
- Slowly changing (updated infrequently)
Example — dim_customers:
| customer_id | name | email | city | state | tier | |-------------|------|-------|------|-------|------| | 501 | Amit Sharma | amit@example.com | Mumbai | Maharashtra | Gold | | 502 | Priya Singh | priya@example.com | Bangalore | Karnataka | Silver |
Example — dim_products:
| product_id | product_name | category | brand | price | |------------|--------------|----------|-------|-------| | 201 | iPhone 15 | Electronics | Apple | 79999 | | 203 | Nike Shoes | Fashion | Nike | 4500 | | 205 | MacBook Pro | Electronics | Apple | 199999 |
Example — dim_date:
| date_id | date | day_of_week | month | quarter | year | is_holiday | |---------|------|-------------|-------|---------|------|------------| | 20260301 | 2026-03-01 | Sunday | March | Q1 | 2026 | false | | 20260302 | 2026-03-02 | Monday | March | Q1 | 2026 | false |
Kimball's Golden Rule: If it's a number you want to analyze (sum, average, count), it's a fact. If it's a label or category you want to group/filter by, it's a dimension.
⚠️ CheckpointQuiz error: Missing or invalid options array
Star Schema — Denormalized for Speed
Star schema is the most common dimensional model. It has:
- One central fact table (the "center of the star")
- Multiple dimension tables (the "points of the star")
- Denormalized dimensions (all attributes in one table, no sub-dimensions)
Structure
dim_date
|
|
dim_customer — fact_orders — dim_product
|
|
dim_city
Example — E-commerce Star Schema
Fact Table: fact_orders
| order_id | customer_id | product_id | date_id | city_id | amount | quantity | |----------|-------------|------------|---------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 10 | 1200 | 2 | | 1002 | 502 | 203 | 20260301 | 11 | 850 | 1 |
Dimension Tables:
dim_customers (denormalized — city, state in one table):
| customer_id | name | city | state | tier | |-------------|------|------|-------|------| | 501 | Amit Sharma | Mumbai | Maharashtra | Gold | | 502 | Priya Singh | Bangalore | Karnataka | Silver |
dim_products (denormalized — brand, category in one table):
| product_id | product_name | category | brand | price | |------------|--------------|----------|-------|-------| | 201 | iPhone 15 | Electronics | Apple | 79999 | | 203 | Nike Shoes | Fashion | Nike | 4500 |
dim_date:
| date_id | date | month | quarter | year | |---------|------|-------|---------|------| | 20260301 | 2026-03-01 | March | Q1 | 2026 |
dim_cities:
| city_id | city | state | region | |---------|------|-------|--------| | 10 | Mumbai | Maharashtra | West | | 11 | Bangalore | Karnataka | South |
Querying the Star Schema
-- Revenue by product category and city in Q1 2026
SELECT
p.category,
c.city,
SUM(f.amount) AS total_revenue,
COUNT(f.order_id) AS order_count
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_cities c ON f.city_id = c.city_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY p.category, c.city
ORDER BY total_revenue DESC;Performance:
- Only 4 tables joined (vs 10+ in normalized schema)
- Dimension tables are small (cached in memory)
- Fact table is large but scanned once
Benefits of Star Schema
- Fast queries (fewer joins)
- Simple to understand (business users can navigate easily)
- Predictable performance (query patterns are similar)
- BI tool friendly (Tableau, Power BI work best with star schemas)
Drawbacks
- Storage overhead (duplicated dimension data)
- Update complexity (changing "Nike" to "Nike Inc" requires updating millions of fact rows if denormalized)
Zomato's data warehouse uses a star schema:
- Fact table: fact_orders (order_id, restaurant_id, customer_id, amount, delivery_time)
- Dimensions: dim_restaurants (name, city, cuisine), dim_customers (name, city, tier), dim_date
Dashboards query this schema to show "revenue by cuisine by city" with minimal joins.
Snowflake Schema — Normalized for Storage
Snowflake schema is like a star schema, but dimensions are normalized into sub-dimensions (hierarchies).
Structure
dim_date
|
dim_brand | dim_category
\ | /
dim_product | dim_state
\ | /
fact_orders
/ | \
dim_customer | dim_city
|
dim_tier
Key difference: Dimension tables are split into normalized hierarchies.
Example — Snowflake Schema
Fact Table: fact_orders (same as star):
| order_id | customer_id | product_id | date_id | amount | quantity | |----------|-------------|------------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 1200 | 2 |
Normalized Dimension Tables:
dim_products (normalized — references dim_brand, dim_category):
| product_id | product_name | brand_id | category_id | price | |------------|--------------|----------|-------------|-------| | 201 | iPhone 15 | 1 | 10 | 79999 | | 203 | Nike Shoes | 5 | 20 | 4500 |
dim_brands (sub-dimension):
| brand_id | brand_name | |----------|------------| | 1 | Apple | | 5 | Nike |
dim_categories (sub-dimension):
| category_id | category_name | |-------------|---------------| | 10 | Electronics | | 20 | Fashion |
dim_customers (normalized — references dim_cities):
| customer_id | name | email | city_id | tier_id | |-------------|------|-------|---------|---------| | 501 | Amit Sharma | amit@example.com | 10 | 1 | | 502 | Priya Singh | priya@example.com | 11 | 2 |
dim_cities (sub-dimension):
| city_id | city | state_id | |---------|------|----------| | 10 | Mumbai | 1 | | 11 | Bangalore | 2 |
dim_states (sub-dimension):
| state_id | state | region | |----------|-------|--------| | 1 | Maharashtra | West | | 2 | Karnataka | South |
Querying the Snowflake Schema
-- Same query as star schema, but MORE joins
SELECT
cat.category_name,
ci.city,
SUM(f.amount) AS total_revenue
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_categories cat ON p.category_id = cat.category_id -- Extra join
JOIN dim_customers cu ON f.customer_id = cu.customer_id
JOIN dim_cities ci ON cu.city_id = ci.city_id -- Extra join
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY cat.category_name, ci.city;More joins = slower queries (but less storage duplication).
Benefits of Snowflake Schema
- Less storage (no duplication of brand names, state names)
- Easier updates (changing "Nike" to "Nike Inc" updates one row in dim_brands)
Drawbacks
- Slower queries (more joins)
- Complex to navigate (business users struggle with 10+ table joins)
- BI tools struggle (many BI tools expect star schemas)
Modern Best Practice: Use star schema by default. Storage is cheap (cloud warehouses like BigQuery charge $0.02/GB/month). Query speed is more valuable than saving a few GB.
Star vs Snowflake — When to Use Each
| Aspect | Star Schema | Snowflake Schema | |--------|-------------|------------------| | Dimension structure | Denormalized (flat) | Normalized (hierarchies) | | Number of joins | Fewer (3-5 typically) | More (10+ possible) | | Query performance | Faster | Slower | | Storage space | More (duplicated data) | Less (no duplication) | | Update complexity | Higher (update many rows) | Lower (update one row) | | BI tool compatibility | Excellent | Good | | Business user friendliness | High (simple joins) | Low (complex joins) | | Use case | Most data warehouses | Legacy systems, extreme storage constraints |
Decision Guide
Use Star Schema when:
- Query performance is critical
- Using cloud warehouses (storage is cheap)
- Business users query the warehouse directly
- Using BI tools (Tableau, Power BI, Looker)
- This is 95% of modern data warehouses
Use Snowflake Schema when:
- Storage is extremely expensive (legacy on-prem systems)
- Dimension updates are frequent and must be efficient
- Data governance requires strict normalization
- This is rare in modern cloud warehouses
Hybrid Approach — "Snowflake-ish Star"
Some warehouses use a hybrid:
- Denormalize high-cardinality dimensions (products, customers) for speed
- Normalize low-cardinality lookup tables (countries, currencies) to save space
Example — Flipkart's hybrid model:
Denormalized (star):
dim_productsincludes brand_name, category_name directly (queried frequently)
Normalized (snowflake):
dim_currenciessplit from transactions (only 5 currencies, rarely queried)
Swiggy uses a pure star schema in BigQuery:
fact_orders(200M rows daily)dim_restaurants(50K rows, includes city, cuisine denormalized)dim_customers(10M rows, includes city denormalized)
Storage cost: $100/month. Query speed: 10x faster than snowflake schema. Clear trade-off in favor of speed.
Slowly Changing Dimensions (SCD)
Problem: What happens when dimension attributes change?
Example: A customer moves from Mumbai to Bangalore. If we update dim_customers.city from "Mumbai" to "Bangalore", historical orders show incorrect city retroactively.
Solution: Slowly Changing Dimension (SCD) strategies.
SCD Type 1 — Overwrite (No History)
Strategy: Overwrite the old value. Lose history.
Before: | customer_id | name | city | |-------------|------|------| | 501 | Amit | Mumbai |
After customer moves: | customer_id | name | city | |-------------|------|------| | 501 | Amit | Bangalore |
Effect: All historical orders now show city = "Bangalore" (even orders placed when he lived in Mumbai).
Use case: Correcting data errors, when history doesn't matter.
SCD Type 2 — Add New Row (Preserve History)
Strategy: Add a new row with a new surrogate key. Keep old row.
Before: | customer_key | customer_id | name | city | valid_from | valid_to | is_current | |--------------|-------------|------|------|------------|----------|------------| | 1 | 501 | Amit | Mumbai | 2024-01-01 | 9999-12-31 | true |
After customer moves: | customer_key | customer_id | name | city | valid_from | valid_to | is_current | |--------------|-------------|------|------|------------|----------|------------| | 1 | 501 | Amit | Mumbai | 2024-01-01 | 2026-03-22 | false | | 2 | 501 | Amit | Bangalore | 2026-03-23 | 9999-12-31 | true |
Effect: Orders before 2026-03-22 link to customer_key = 1 (Mumbai). Orders after link to customer_key = 2 (Bangalore).
Use case: When historical accuracy matters (most warehouses use this).
SCD Type 3 — Add New Column (Limited History)
Strategy: Add a column for previous value.
| customer_id | name | current_city | previous_city | |-------------|------|--------------|---------------| | 501 | Amit | Bangalore | Mumbai |
Use case: Rare — only tracks one previous value.
Modern Best Practice: Use SCD Type 2 for dimensions where history matters (customers, products). Use SCD Type 1 for corrections (fixing typos) or dimensions where history doesn't matter (currency exchange rates updated daily).
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}