What is What is a Data Warehouse? Complete Guide for Analysts?

Understand data warehouses, how they differ from databases, and why they're essential for analytics. Learn with real-world examples from Flipkart, Swiggy, and BigQuery.

Is What is a Data Warehouse? Complete Guide for Analysts suitable for beginners?

This topic is designed for Beginner level learners. It takes approximately 10 min to complete and includes 7 interactive quizzes to test your understanding.

How long does it take to learn What is a Data Warehouse? Complete Guide for Analysts?

You can complete this topic in about 10 min. The topic is part 51 of undefined in our comprehensive Data Analytics Learning Path.

What is a Data Warehouse? Complete Guide for Data Analysts | DataPath

🏢

What is a Data Warehouse?

A data warehouse is a centralized repository designed for analytical queries on large volumes of historical data. Unlike operational databases (which handle day-to-day transactions), warehouses are optimized for reading and aggregating massive datasets.

Key Characteristics

| Property | Data Warehouse | Operational Database | |----------|----------------|---------------------| | Purpose | Analysis & reporting | Transactions (CRUD operations) | | Optimized for | Large reads, aggregations | Fast writes, updates | | Data scope | Historical, consolidated | Current, transactional | | Query pattern | Complex, analytical (OLAP) | Simple, transactional (OLTP) | | Update frequency | Batch (hourly/daily) | Real-time | | Schema | Denormalized (star/snowflake) | Normalized (3NF) |

Why Data Warehouses Exist

Imagine Flipkart's operational MySQL database handling 10,000 orders per minute. Running a query like "What's our year-over-year revenue growth by category?" would:

Lock tables and slow down customer checkouts
Take hours to scan billions of rows
Compete with real-time transactions

Instead, Flipkart copies data to a data warehouse (like BigQuery or Redshift) where analysts can run massive queries without affecting the production system.

Think of it this way...

An operational database is like a busy restaurant kitchen — fast, real-time, handling hundreds of orders. A data warehouse is like the restaurant's financial office — you don't cook orders there, but you analyze sales trends, ingredient costs, and customer behavior over months or years.

⚖️

OLAP vs OLTP — The Core Difference

OLTP (Online Transaction Processing)

Purpose: Handle real-time transactions (INSERT, UPDATE, DELETE)

Examples:

Swiggy's order placement system
PhonePe's payment processing
Amazon's inventory management

Characteristics:

Hundreds/thousands of small, fast queries per second
Writes are frequent
Data is normalized (multiple tables with foreign keys)
Row-oriented storage

query.sqlSQL

-- OLTP query (fast, single record)
INSERT INTO orders (customer_id, restaurant_id, amount, timestamp)
VALUES (12345, 678, 450, NOW());

UPDATE inventory SET stock = stock - 1 WHERE product_id = 9876;

OLAP (Online Analytical Processing)

Purpose: Run complex analytical queries on large datasets

Examples:

"What's the average delivery time by city over the last year?"
"Which product categories drove the most revenue last quarter?"
"What's our monthly user retention cohort analysis?"

Characteristics:

Few but complex queries (scanning millions of rows)
Mostly reads (no live updates)
Denormalized schemas (pre-joined data)
Column-oriented storage (better for aggregations)

query.sqlSQL

-- OLAP query (complex, millions of rows)
SELECT
  DATE_TRUNC('month', order_date) AS month,
  city,
  AVG(delivery_time_minutes) AS avg_delivery_time,
  COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY month, city
ORDER BY month, avg_delivery_time DESC;

Info

Key Insight: OLTP databases are optimized for writing data quickly. OLAP warehouses are optimized for reading and aggregating data at massive scale. This is why analysts work in warehouses, not production databases.

⚠️ CheckpointQuiz error: Missing or invalid options array

🏗️

Data Warehouse Architecture

A typical data warehouse architecture has three layers:

1. Source Layer (Operational Systems)

Data originates from multiple sources:

Transactional databases: Orders, customers, inventory
APIs: Payment gateways, logistics providers
Log files: Web server logs, app events
Third-party data: Marketing platforms, CRM tools

Example — Flipkart's sources:

MySQL: Orders, products, users
MongoDB: Customer reviews, search logs
Google Analytics: Web traffic
Facebook Ads API: Ad spend and conversions

2. ETL/ELT Layer (Data Pipeline)

ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes move data from sources to the warehouse:

Source DB → Extract → Transform (clean, join) → Load → Warehouse

Example — Daily ETL for Swiggy:

Extract: Pull yesterday's orders from the MySQL database
Transform: Join orders with restaurant data, calculate delivery times, flag outliers
Load: Insert transformed data into BigQuery

Tools: Apache Airflow, dbt, Fivetran, Airbyte

3. Warehouse Layer (Storage + Compute)

The actual warehouse stores data in a schema optimized for analytics:

Popular Warehouses:

Google BigQuery (serverless, pay-per-query)
Amazon Redshift (Postgres-compatible, scalable)
Snowflake (cloud-native, auto-scaling)
Azure Synapse Analytics

Schema Design: Data is organized in a star schema or snowflake schema (denormalized for fast reads).

Think of it this way...

Flipkart's operational MySQL database handles 10,000 writes/sec during sales. Every night, an ETL pipeline extracts orders, joins them with product and user data, and loads into BigQuery. Analysts query BigQuery the next morning — no impact on customer checkouts.

⭐

Star Schema — The Warehouse Data Model

Warehouses use denormalized schemas to speed up queries. The most common is the star schema:

Structure

Fact table (center): Measurable events (orders, pageviews, payments)
Dimension tables (points of the star): Descriptive attributes (customers, products, dates)

Example — E-commerce Warehouse

Fact Table: fact_orders

| order_id | customer_id | product_id | date_id | amount | quantity | |----------|-------------|------------|---------|--------|----------| | 1001 | 501 | 201 | 20260301 | 1200 | 2 | | 1002 | 502 | 203 | 20260301 | 850 | 1 |

Dimension Tables:

dim_customers

| customer_id | name | city | tier | |-------------|------|------|------| | 501 | Amit Sharma | Mumbai | Gold | | 502 | Priya Singh | Bangalore | Silver |

dim_products

| product_id | product_name | category | brand | |------------|--------------|----------|-------| | 201 | iPhone 15 | Electronics | Apple | | 203 | Nike Shoes | Fashion | Nike |

dim_date

| date_id | date | month | quarter | year | |---------|------|-------|---------|------| | 20260301 | 2026-03-01 | March | Q1 | 2026 |

Querying the Star Schema

query.sqlSQL

-- Revenue by product category in Q1 2026
SELECT
  p.category,
  SUM(f.amount) AS total_revenue,
  COUNT(f.order_id) AS order_count
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.quarter = 'Q1' AND d.year = 2026
GROUP BY p.category
ORDER BY total_revenue DESC;

Why This is Fast:

Fact table is pre-aggregated (no complex joins across 20 tables)
Dimension tables are small and cached
Queries scan the large fact table once, then join small dimensions

Info

Star Schema vs Normalized Schema: In a normalized OLTP database, this query might join 8+ tables (orders → order_items → products → categories → brands → customers → addresses → dates). Star schema flattens this into 3 pre-joined tables, making queries 10-100x faster.

💡

Why Analysts Need Data Warehouses

1. Single Source of Truth

Instead of querying 15 different databases and Excel files, analysts query one centralized warehouse.

Example: Zomato's warehouse consolidates:

Orders from MySQL
Delivery partner GPS data from MongoDB
Restaurant reviews from Elasticsearch
Ad spend from Facebook Ads API

An analyst can now answer "What's the correlation between ad spend and new customer orders?" with a single query.

2. Historical Data Retention

Operational databases delete old data to save space. Warehouses keep years of history.

Example: Flipkart's MySQL database might keep only 3 months of orders. The warehouse stores 7 years of order history — enabling year-over-year trend analysis.

3. Performance for Analytics

Warehouses use columnar storage (not row-based like MySQL), which is 10-100x faster for aggregations.

Example: Calculating AVG(delivery_time) across 100M rows:

MySQL (row-based): 45 seconds
BigQuery (columnar): 2 seconds

4. No Impact on Production

Analysts can run heavy queries without slowing down customer-facing systems.

5. Data Quality & Governance

Warehouses enforce schemas, data types, and validation rules. Bad data is caught before analysts see it.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}