Topic 51 of

What is a Data Warehouse? Complete Guide for Analysts

A data warehouse is not just a bigger database. It's a system designed specifically for analysis โ€” optimized to answer questions like 'What were our sales trends over the last 5 years?' across billions of rows in seconds.

๐Ÿ“šBeginner
โฑ๏ธ10 min
โœ…7 quizzes
๐Ÿข

What is a Data Warehouse?

A data warehouse is a centralized repository designed for analytical queries on large volumes of historical data. Unlike operational databases (which handle day-to-day transactions), warehouses are optimized for reading and aggregating massive datasets.

Key Characteristics

| Property | Data Warehouse | Operational Database | |----------|----------------|---------------------| | Purpose | Analysis & reporting | Transactions (CRUD operations) | | Optimized for | Large reads, aggregations | Fast writes, updates | | Data scope | Historical, consolidated | Current, transactional | | Query pattern | Complex, analytical (OLAP) | Simple, transactional (OLTP) | | Update frequency | Batch (hourly/daily) | Real-time | | Schema | Denormalized (star/snowflake) | Normalized (3NF) |

Why Data Warehouses Exist

Imagine Flipkart's operational MySQL database handling 10,000 orders per minute. Running a query like "What's our year-over-year revenue growth by category?" would:

  • Lock tables and slow down customer checkouts
  • Take hours to scan billions of rows
  • Compete with real-time transactions

Instead, Flipkart copies data to a data warehouse (like BigQuery or Redshift) where analysts can run massive queries without affecting the production system.

Think of it this way...

An operational database is like a busy restaurant kitchen โ€” fast, real-time, handling hundreds of orders. A data warehouse is like the restaurant's financial office โ€” you don't cook orders there, but you analyze sales trends, ingredient costs, and customer behavior over months or years.

โš–๏ธ

OLAP vs OLTP โ€” The Core Difference

OLTP (Online Transaction Processing)

Purpose: Handle real-time transactions (INSERT, UPDATE, DELETE)

Examples:

  • Swiggy's order placement system
  • PhonePe's payment processing
  • Amazon's inventory management

Characteristics:

  • Hundreds/thousands of small, fast queries per second
  • Writes are frequent
  • Data is normalized (multiple tables with foreign keys)
  • Row-oriented storage
query.sqlSQL
-- OLTP query (fast, single record)
INSERT INTO orders (customer_id, restaurant_id, amount, timestamp)
VALUES (12345, 678, 450, NOW());

UPDATE inventory SET stock = stock - 1 WHERE product_id = 9876;

OLAP (Online Analytical Processing)

Purpose: Run complex analytical queries on large datasets

Examples:

  • "What's the average delivery time by city over the last year?"
  • "Which product categories drove the most revenue last quarter?"
  • "What's our monthly user retention cohort analysis?"

Characteristics:

  • Few but complex queries (scanning millions of rows)
  • Mostly reads (no live updates)
  • Denormalized schemas (pre-joined data)
  • Column-oriented storage (better for aggregations)
query.sqlSQL
-- OLAP query (complex, millions of rows)
SELECT
  DATE_TRUNC('month', order_date) AS month,
  city,
  AVG(delivery_time_minutes) AS avg_delivery_time,
  COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY month, city
ORDER BY month, avg_delivery_time DESC;
Info

Key Insight: OLTP databases are optimized for writing data quickly. OLAP warehouses are optimized for reading and aggregating data at massive scale. This is why analysts work in warehouses, not production databases.

โš ๏ธ CheckpointQuiz error: Missing or invalid options array

๐Ÿ—๏ธ

Data Warehouse Architecture

A typical data warehouse architecture has three layers:

1. Source Layer (Operational Systems)

Data originates from multiple sources:

  • Transactional databases: Orders, customers, inventory
  • APIs: Payment gateways, logistics providers
  • Log files: Web server logs, app events
  • Third-party data: Marketing platforms, CRM tools

Example โ€” Flipkart's sources:

  • MySQL: Orders, products, users
  • MongoDB: Customer reviews, search logs
  • Google Analytics: Web traffic
  • Facebook Ads API: Ad spend and conversions

2. ETL/ELT Layer (Data Pipeline)

ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes move data from sources to the warehouse:

Source DB โ†’ Extract โ†’ Transform (clean, join) โ†’ Load โ†’ Warehouse

Example โ€” Daily ETL for Swiggy:

  1. Extract: Pull yesterday's orders from the MySQL database
  2. Transform: Join orders with restaurant data, calculate delivery times, flag outliers
  3. Load: Insert transformed data into BigQuery

Tools: Apache Airflow, dbt, Fivetran, Airbyte

3. Warehouse Layer (Storage + Compute)

The actual warehouse stores data in a schema optimized for analytics:

Popular Warehouses:

  • Google BigQuery (serverless, pay-per-query)
  • Amazon Redshift (Postgres-compatible, scalable)
  • Snowflake (cloud-native, auto-scaling)
  • Azure Synapse Analytics

Schema Design: Data is organized in a star schema or snowflake schema (denormalized for fast reads).

Think of it this way...

Flipkart's operational MySQL database handles 10,000 writes/sec during sales. Every night, an ETL pipeline extracts orders, joins them with product and user data, and loads into BigQuery. Analysts query BigQuery the next morning โ€” no impact on customer checkouts.

โญ

Star Schema โ€” The Warehouse Data Model

Warehouses use denormalized schemas to speed up queries. The most common is the star schema:

Structure

  • Fact table (center): Measurable events (orders, pageviews, payments)
  • Dimension tables (points of the star): Descriptive attributes (customers, products, dates)

Example โ€” E-commerce Warehouse

Fact Table: fact_orders

| order_id | customer_id | product_id | date_id | amount | quantity | |----------|-------------|------------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 1200 | 2 | | 1002 | 502 | 203 | 20260301 | 850 | 1 |

Dimension Tables:

dim_customers

| customer_id | name | city | tier | |-------------|------|------|------| | 501 | Amit Sharma | Mumbai | Gold | | 502 | Priya Singh | Bangalore | Silver |

dim_products

| product_id | product_name | category | brand | |------------|--------------|----------|-------| | 201 | iPhone 15 | Electronics | Apple | | 203 | Nike Shoes | Fashion | Nike |

dim_date

| date_id | date | month | quarter | year | |---------|------|-------|---------|------| | 20260301 | 2026-03-01 | March | Q1 | 2026 |

Querying the Star Schema

query.sqlSQL
-- Revenue by product category in Q1 2026
SELECT
  p.category,
  SUM(f.amount) AS total_revenue,
  COUNT(f.order_id) AS order_count
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY p.category
ORDER BY total_revenue DESC;

Why This is Fast:

  • Fact table is pre-aggregated (no complex joins across 20 tables)
  • Dimension tables are small and cached
  • Queries scan the large fact table once, then join small dimensions
Info

Star Schema vs Normalized Schema: In a normalized OLTP database, this query might join 8+ tables (orders โ†’ order_items โ†’ products โ†’ categories โ†’ brands โ†’ customers โ†’ addresses โ†’ dates). Star schema flattens this into 3 pre-joined tables, making queries 10-100x faster.

๐Ÿ’ก

Why Analysts Need Data Warehouses

1. Single Source of Truth

Instead of querying 15 different databases and Excel files, analysts query one centralized warehouse.

Example: Zomato's warehouse consolidates:

  • Orders from MySQL
  • Delivery partner GPS data from MongoDB
  • Restaurant reviews from Elasticsearch
  • Ad spend from Facebook Ads API

An analyst can now answer "What's the correlation between ad spend and new customer orders?" with a single query.

2. Historical Data Retention

Operational databases delete old data to save space. Warehouses keep years of history.

Example: Flipkart's MySQL database might keep only 3 months of orders. The warehouse stores 7 years of order history โ€” enabling year-over-year trend analysis.

3. Performance for Analytics

Warehouses use columnar storage (not row-based like MySQL), which is 10-100x faster for aggregations.

Example: Calculating AVG(delivery_time) across 100M rows:

  • MySQL (row-based): 45 seconds
  • BigQuery (columnar): 2 seconds

4. No Impact on Production

Analysts can run heavy queries without slowing down customer-facing systems.

5. Data Quality & Governance

Warehouses enforce schemas, data types, and validation rules. Bad data is caught before analysts see it.

โš ๏ธ FinalQuiz error: Missing or invalid questions array

โš ๏ธ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}