What is a Data Warehouse?
A data warehouse is a centralized repository designed for analytical queries on large volumes of historical data. Unlike operational databases (which handle day-to-day transactions), warehouses are optimized for reading and aggregating massive datasets.
Key Characteristics
| Property | Data Warehouse | Operational Database | |----------|----------------|---------------------| | Purpose | Analysis & reporting | Transactions (CRUD operations) | | Optimized for | Large reads, aggregations | Fast writes, updates | | Data scope | Historical, consolidated | Current, transactional | | Query pattern | Complex, analytical (OLAP) | Simple, transactional (OLTP) | | Update frequency | Batch (hourly/daily) | Real-time | | Schema | Denormalized (star/snowflake) | Normalized (3NF) |
Why Data Warehouses Exist
Imagine Flipkart's operational MySQL database handling 10,000 orders per minute. Running a query like "What's our year-over-year revenue growth by category?" would:
- Lock tables and slow down customer checkouts
- Take hours to scan billions of rows
- Compete with real-time transactions
Instead, Flipkart copies data to a data warehouse (like BigQuery or Redshift) where analysts can run massive queries without affecting the production system.
An operational database is like a busy restaurant kitchen โ fast, real-time, handling hundreds of orders. A data warehouse is like the restaurant's financial office โ you don't cook orders there, but you analyze sales trends, ingredient costs, and customer behavior over months or years.
OLAP vs OLTP โ The Core Difference
OLTP (Online Transaction Processing)
Purpose: Handle real-time transactions (INSERT, UPDATE, DELETE)
Examples:
- Swiggy's order placement system
- PhonePe's payment processing
- Amazon's inventory management
Characteristics:
- Hundreds/thousands of small, fast queries per second
- Writes are frequent
- Data is normalized (multiple tables with foreign keys)
- Row-oriented storage
-- OLTP query (fast, single record)
INSERT INTO orders (customer_id, restaurant_id, amount, timestamp)
VALUES (12345, 678, 450, NOW());
UPDATE inventory SET stock = stock - 1 WHERE product_id = 9876;OLAP (Online Analytical Processing)
Purpose: Run complex analytical queries on large datasets
Examples:
- "What's the average delivery time by city over the last year?"
- "Which product categories drove the most revenue last quarter?"
- "What's our monthly user retention cohort analysis?"
Characteristics:
- Few but complex queries (scanning millions of rows)
- Mostly reads (no live updates)
- Denormalized schemas (pre-joined data)
- Column-oriented storage (better for aggregations)
-- OLAP query (complex, millions of rows)
SELECT
DATE_TRUNC('month', order_date) AS month,
city,
AVG(delivery_time_minutes) AS avg_delivery_time,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY month, city
ORDER BY month, avg_delivery_time DESC;Key Insight: OLTP databases are optimized for writing data quickly. OLAP warehouses are optimized for reading and aggregating data at massive scale. This is why analysts work in warehouses, not production databases.
โ ๏ธ CheckpointQuiz error: Missing or invalid options array
Data Warehouse Architecture
A typical data warehouse architecture has three layers:
1. Source Layer (Operational Systems)
Data originates from multiple sources:
- Transactional databases: Orders, customers, inventory
- APIs: Payment gateways, logistics providers
- Log files: Web server logs, app events
- Third-party data: Marketing platforms, CRM tools
Example โ Flipkart's sources:
- MySQL: Orders, products, users
- MongoDB: Customer reviews, search logs
- Google Analytics: Web traffic
- Facebook Ads API: Ad spend and conversions
2. ETL/ELT Layer (Data Pipeline)
ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes move data from sources to the warehouse:
Source DB โ Extract โ Transform (clean, join) โ Load โ Warehouse
Example โ Daily ETL for Swiggy:
- Extract: Pull yesterday's orders from the MySQL database
- Transform: Join orders with restaurant data, calculate delivery times, flag outliers
- Load: Insert transformed data into BigQuery
Tools: Apache Airflow, dbt, Fivetran, Airbyte
3. Warehouse Layer (Storage + Compute)
The actual warehouse stores data in a schema optimized for analytics:
Popular Warehouses:
- Google BigQuery (serverless, pay-per-query)
- Amazon Redshift (Postgres-compatible, scalable)
- Snowflake (cloud-native, auto-scaling)
- Azure Synapse Analytics
Schema Design: Data is organized in a star schema or snowflake schema (denormalized for fast reads).
Flipkart's operational MySQL database handles 10,000 writes/sec during sales. Every night, an ETL pipeline extracts orders, joins them with product and user data, and loads into BigQuery. Analysts query BigQuery the next morning โ no impact on customer checkouts.
Star Schema โ The Warehouse Data Model
Warehouses use denormalized schemas to speed up queries. The most common is the star schema:
Structure
- Fact table (center): Measurable events (orders, pageviews, payments)
- Dimension tables (points of the star): Descriptive attributes (customers, products, dates)
Example โ E-commerce Warehouse
Fact Table: fact_orders
| order_id | customer_id | product_id | date_id | amount | quantity | |----------|-------------|------------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 1200 | 2 | | 1002 | 502 | 203 | 20260301 | 850 | 1 |
Dimension Tables:
dim_customers
| customer_id | name | city | tier | |-------------|------|------|------| | 501 | Amit Sharma | Mumbai | Gold | | 502 | Priya Singh | Bangalore | Silver |
dim_products
| product_id | product_name | category | brand | |------------|--------------|----------|-------| | 201 | iPhone 15 | Electronics | Apple | | 203 | Nike Shoes | Fashion | Nike |
dim_date
| date_id | date | month | quarter | year | |---------|------|-------|---------|------| | 20260301 | 2026-03-01 | March | Q1 | 2026 |
Querying the Star Schema
-- Revenue by product category in Q1 2026
SELECT
p.category,
SUM(f.amount) AS total_revenue,
COUNT(f.order_id) AS order_count
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY p.category
ORDER BY total_revenue DESC;Why This is Fast:
- Fact table is pre-aggregated (no complex joins across 20 tables)
- Dimension tables are small and cached
- Queries scan the large fact table once, then join small dimensions
Star Schema vs Normalized Schema: In a normalized OLTP database, this query might join 8+ tables (orders โ order_items โ products โ categories โ brands โ customers โ addresses โ dates). Star schema flattens this into 3 pre-joined tables, making queries 10-100x faster.
Why Analysts Need Data Warehouses
1. Single Source of Truth
Instead of querying 15 different databases and Excel files, analysts query one centralized warehouse.
Example: Zomato's warehouse consolidates:
- Orders from MySQL
- Delivery partner GPS data from MongoDB
- Restaurant reviews from Elasticsearch
- Ad spend from Facebook Ads API
An analyst can now answer "What's the correlation between ad spend and new customer orders?" with a single query.
2. Historical Data Retention
Operational databases delete old data to save space. Warehouses keep years of history.
Example: Flipkart's MySQL database might keep only 3 months of orders. The warehouse stores 7 years of order history โ enabling year-over-year trend analysis.
3. Performance for Analytics
Warehouses use columnar storage (not row-based like MySQL), which is 10-100x faster for aggregations.
Example: Calculating AVG(delivery_time) across 100M rows:
- MySQL (row-based): 45 seconds
- BigQuery (columnar): 2 seconds
4. No Impact on Production
Analysts can run heavy queries without slowing down customer-facing systems.
5. Data Quality & Governance
Warehouses enforce schemas, data types, and validation rules. Bad data is caught before analysts see it.
โ ๏ธ FinalQuiz error: Missing or invalid questions array
โ ๏ธ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}