What is Google BigQuery?
Google BigQuery is a serverless, highly scalable data warehouse built for analytics. Unlike traditional databases (MySQL, PostgreSQL), you don't manage servers, configure clusters, or tune indexes. You just load data and query it.
Key Features
| Feature | What it Means | Benefit | |---------|---------------|---------| | Serverless | No infrastructure to manage | Start querying in seconds, no DevOps | | Columnar storage | Data stored by column, not row | Aggregations (SUM, AVG) are 10-100x faster | | Massively parallel | Queries run on thousands of nodes | Scan terabytes in seconds | | Pay-per-query | $5 per TB scanned (on-demand) | Only pay for what you query, not storage | | Separation of storage/compute | Storage and queries billed separately | Store petabytes cheaply, query on-demand |
How it Differs from Traditional Databases
MySQL/PostgreSQL (OLTP):
- Row-oriented (fast for single-row reads/writes)
- You manage servers, memory, indexes
- Best for: Transactional workloads (INSERT, UPDATE, DELETE)
BigQuery (OLAP):
- Column-oriented (fast for aggregations)
- Fully managed (no servers)
- Best for: Analytical queries (SELECT with GROUP BY across millions of rows)
MySQL is like renting a car โ you manage fuel, maintenance, and parking. BigQuery is like Uber โ you just say where you want to go, and Google handles everything. You pay per ride (per query), not monthly rent.
BigQuery Basics โ Datasets, Tables, and Queries
Structure
BigQuery organizes data hierarchically:
Project (billing account)
โโโ Dataset (like a database schema)
โโโ Table (actual data)
Example โ Swiggy's BigQuery setup:
swiggy-prod (project)
โโโ raw_data (dataset)
โ โโโ orders
โ โโโ customers
โ โโโ restaurants
โโโ analytics (dataset)
โโโ fact_orders
โโโ dim_customers
Creating a Dataset
-- In BigQuery Console
CREATE SCHEMA `swiggy-prod.analytics`
OPTIONS (
location = 'asia-south1', -- Mumbai region
description = 'Cleaned, transformed tables for dashboards'
);Creating a Table
CREATE TABLE `swiggy-prod.analytics.fact_orders` (
order_id STRING NOT NULL,
customer_id STRING NOT NULL,
restaurant_id STRING NOT NULL,
order_date DATE NOT NULL,
delivery_time_minutes INT64,
amount FLOAT64,
city STRING
)
PARTITION BY order_date
CLUSTER BY city;Querying a Table
-- Total orders and revenue by city in March 2026
SELECT
city,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(delivery_time_minutes) AS avg_delivery_time
FROM `swiggy-prod.analytics.fact_orders`
WHERE order_date BETWEEN '2026-03-01' AND '2026-03-31'
GROUP BY city
ORDER BY total_revenue DESC;Result:
| city | order_count | total_revenue | avg_delivery_time | |------|-------------|---------------|-------------------| | Mumbai | 1,245,000 | 6,22,50,000 | 27.3 | | Bangalore | 1,098,000 | 5,49,00,000 | 24.8 | | Delhi | 987,000 | 4,93,50,000 | 29.1 |
BigQuery Syntax: Table names are fully qualified: `project.dataset.table`. Backticks are required if names contain hyphens (e.g., swiggy-prod).
โ ๏ธ CheckpointQuiz error: Missing or invalid options array
BigQuery Pricing โ Pay Only for What You Query
BigQuery has two pricing models:
1. On-Demand Pricing (Most Common)
You pay per query based on data scanned:
- $5 per TB scanned (first 1 TB per month is free)
- Storage: $0.02 per GB per month (active), $0.01 per GB (long-term, 90+ days)
Example costs:
| Query | Data Scanned | Cost |
|-------|--------------|------|
| SELECT COUNT(*) FROM orders (500 GB table) | 500 GB | $2.50 |
| SELECT city, SUM(amount) FROM orders GROUP BY city (500 GB table) | 500 GB | $2.50 |
| SELECT * FROM orders WHERE order_date = '2026-03-22' (partitioned, 2 GB) | 2 GB | $0.01 |
| SELECT order_id, amount FROM orders (only 2 columns, 50 GB) | 50 GB | $0.25 |
Key insight: Columnar storage means you only pay for columns you query. Selecting 2 columns from a 50-column table is 25x cheaper than SELECT *.
2. Flat-Rate Pricing (For Heavy Users)
Pay a fixed monthly fee for guaranteed query capacity:
- $2,000/month for 100 slots (processing units)
- Good for companies running 24/7 queries (e.g., real-time dashboards)
Example: Zepto (10-minute delivery) runs hundreds of queries per hour. They use flat-rate pricing to cap costs at $2,000/month instead of paying per query.
Cost Optimization Tips
- Use partitioning (scan only relevant partitions)
- Use clustering (prune data within partitions)
- Select only needed columns (avoid
SELECT *) - Preview data (use
LIMITor table preview, which is free) - Use approximate aggregations (
APPROX_COUNT_DISTINCTvsCOUNT(DISTINCT))
Free Tier: The first 1 TB scanned per month is free. For small projects, you might never pay. Storage is also free up to 10 GB.
Partitioning and Clustering โ Optimizing Query Performance
BigQuery's two key optimization features:
Partitioning โ Split Tables by Date/Range
Partitioning divides a table into segments based on a column (usually a date). Queries that filter on the partition column scan only relevant partitions, not the entire table.
Example โ Partitioning by order_date:
CREATE TABLE `swiggy-prod.analytics.fact_orders`
PARTITION BY order_date
AS
SELECT * FROM `swiggy-prod.raw.orders`;Without partitioning:
-- Scans entire 2 TB table
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22';
-- Cost: $10With partitioning:
-- Scans only the 2026-03-22 partition (2 GB)
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22';
-- Cost: $0.01Partition types:
- Daily: One partition per day (most common)
- Monthly: One partition per month
- Integer range: Partition by customer_id ranges
Clustering โ Organize Data Within Partitions
Clustering sorts data within each partition by one or more columns. Queries filtering on clustered columns skip irrelevant blocks.
Example โ Clustering by city:
CREATE TABLE `swiggy-prod.analytics.fact_orders`
PARTITION BY order_date
CLUSTER BY city
AS
SELECT * FROM `swiggy-prod.raw.orders`;Query:
-- Partition prunes to one day (2 GB)
-- Clustering prunes to Mumbai blocks (200 MB)
SELECT COUNT(*) FROM fact_orders
WHERE order_date = '2026-03-22'
AND city = 'Mumbai';
-- Cost: $0.001 (only 200 MB scanned)When to use clustering:
- High-cardinality columns (city, product_id, customer_id)
- Frequently filtered columns
- Up to 4 clustering columns
Flipkart's orders table (5 billion rows, 2 TB) is:
- Partitioned by order_date (daily)
- Clustered by category, city
A query for "Electronics orders in Mumbai on 2026-03-22" scans only:
- One partition (one day)
- Electronics + Mumbai blocks within that partition
Without optimization: 2 TB ($10) With optimization: 50 MB ($0.0003)
BigQuery Public Datasets โ Free Data to Practice
BigQuery provides free public datasets (Google doesn't charge for storage, you only pay for queries):
Popular Public Datasets
-
Google Analytics Sample
bigquery-public-data.ga4_obfuscated_sample_ecommerce- Real e-commerce website traffic and conversions
-
COVID-19 Open Data
bigquery-public-data.covid19_open_data.covid19_open_data- Global COVID-19 cases, deaths, vaccinations
-
GitHub Archive
githubarchive.day.20260322- Every public GitHub event (commits, issues, PRs)
-
New York Taxi Trips
bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2023- 1.4 billion taxi rides with pickup/dropoff times, fares, distances
-
Stack Overflow
bigquery-public-data.stackoverflow.posts_questions- 50+ million Stack Overflow questions and answers
Example โ Analyzing GitHub Activity
-- Top 10 most active GitHub repositories on March 22, 2026
SELECT
repo.name AS repo_name,
COUNT(*) AS event_count
FROM `githubarchive.day.20260322`
GROUP BY repo_name
ORDER BY event_count DESC
LIMIT 10;Result:
| repo_name | event_count | |-----------|-------------| | microsoft/vscode | 12,450 | | pytorch/pytorch | 8,932 | | facebook/react | 7,821 |
Example โ COVID-19 Analysis (India)
-- Daily new COVID-19 cases in India (last 30 days)
SELECT
date,
new_confirmed AS new_cases,
new_deceased AS new_deaths
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_name = 'India'
AND date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
ORDER BY date;Practice for Free: Query public datasets to learn BigQuery without using your own data. The first 1 TB/month is free, and most public dataset queries are under 10 GB.
Advanced BigQuery Features
1. Nested and Repeated Fields (Arrays/Structs)
BigQuery supports complex data types (unlike traditional SQL):
-- Table with nested fields
CREATE TABLE orders (
order_id STRING,
customer STRUCT<
name STRING,
email STRING,
address STRUCT<city STRING, state STRING>
>,
items ARRAY<STRUCT<
product_id STRING,
quantity INT64,
price FLOAT64
>>
);
-- Query nested fields
SELECT
order_id,
customer.name AS customer_name,
customer.address.city AS city,
-- Unnest array to get individual items
item.product_id,
item.quantity
FROM orders
CROSS JOIN UNNEST(items) AS item
WHERE customer.address.city = 'Mumbai';Use case: Storing JSON events (e.g., Google Analytics hits) without flattening.
2. Scheduled Queries
Automate queries to run daily/hourly:
-- This query runs daily at 2 AM
CREATE OR REPLACE TABLE analytics.daily_summary AS
SELECT
order_date,
COUNT(*) AS order_count,
SUM(amount) AS revenue
FROM raw.orders
WHERE order_date = CURRENT_DATE() - 1
GROUP BY order_date;Set schedule: In BigQuery Console โ Schedule Query โ Daily at 02:00
3. Machine Learning with BigQuery ML
Train ML models using SQL:
-- Train a model to predict customer churn
CREATE OR REPLACE MODEL analytics.churn_model
OPTIONS(model_type='logistic_reg', input_label_cols=['churned']) AS
SELECT
days_since_last_order,
total_orders,
avg_order_value,
churned
FROM analytics.customer_features;
-- Predict churn for current customers
SELECT
customer_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL analytics.churn_model,
(SELECT * FROM analytics.current_customers)
)
ORDER BY churn_probability DESC;Use cases: Churn prediction, recommendation systems, demand forecasting
4. Streaming Inserts
Insert rows in real-time (not batches):
from google.cloud import bigquery
client = bigquery.Client()
table_id = "swiggy-prod.analytics.realtime_orders"
rows_to_insert = [
{"order_id": "12345", "amount": 450, "timestamp": "2026-03-22T14:32:10"},
{"order_id": "12346", "amount": 890, "timestamp": "2026-03-22T14:32:15"},
]
errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
print(f"Errors: {errors}")Use case: Real-time dashboards (e.g., live order tracking)
Streaming cost: $0.01 per 200 MB inserted. For high-volume streaming, use BigQuery Storage Write API (cheaper).
โ ๏ธ FinalQuiz error: Missing or invalid questions array
โ ๏ธ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}