Topic 48 of

A/B Test Sample Size Calculator — Plan Powerful Experiments

Running an A/B test without calculating sample size is like building a bridge without checking if it can hold weight. Use this calculator to ensure your experiments have enough statistical power.

📚Intermediate
⏱️8 min
10 quizzes
🎯

Why Sample Size Calculation Matters

Sample size determines whether your A/B test can detect meaningful differences with statistical confidence.

The Problem: Underpowered Tests

Scenario: Flipkart wants to test new checkout flow.

Without Sample Size Calculation:

"Let's test for 3 days and see what happens" Day 3: Control 2.5% conversion (5,000 users), Treatment 2.7% conversion (5,000 users) Difference: +0.2% (8% relative lift) P-value: 0.18 (not significant) Conclusion: "New checkout doesn't work" ❌

Reality: Test was underpowered (sample too small). The 8% lift might be REAL, but you can't detect it with only 5,000 users per group.


With Sample Size Calculation:

Inputs: - Baseline: 2.5% conversion - MDE: 8% relative lift (2.5% → 2.7%) - Significance: α = 0.05 - Power: 80% Calculator says: Need 62,340 users per group (124,680 total) Run test properly: Day 15: Control 2.5% (62K users), Treatment 2.7% (62K users) P-value: 0.03 (significant!) Conclusion: "New checkout increases conversion 8%" ✓

What Sample Size Calculation Tells You

  1. How many users needed — Per variant (e.g., 50,000 per group = 100,000 total)
  2. How long test will take — Days = (Sample size) / (Daily traffic × % in test)
  3. Whether test is feasible — If you need 1M users but only get 10K/day, test takes 100 days (too long)
Think of it this way...

Sample size is like survey margin of error. Polling 100 people gives ±10% margin (useless for close elections). Polling 1,000 gives ±3% margin (useful). Sample size calculation ensures your A/B test has enough "resolution" to detect small differences confidently.

🧮

A/B Test Sample Size Calculator

Use this calculator to determine required sample size for your A/B test.

A/B Test Sample Size Calculator

Calculator functionality coming soon...

How to Use This Calculator

Step 1: Enter Baseline Conversion Rate

  • Current conversion rate (before treatment)
  • Example: If 500 out of 10,000 users convert → 5% baseline

Step 2: Enter Minimum Detectable Effect (MDE)

  • Smallest change you care about detecting
  • Can enter as relative % (e.g., 10% = 5.0% → 5.5%) or absolute % (e.g., 0.5% = 5.0% → 5.5%)

Step 3: Set Significance Level (α)

  • Probability of false positive (claiming difference when there is none)
  • Standard: 0.05 (5% false positive rate)
  • Conservative: 0.01 (1% false positive rate, needs larger sample)

Step 4: Set Statistical Power

  • Probability of detecting real difference (if it exists)
  • Standard: 0.80 (80% power, 20% false negative rate)
  • High power: 0.90 (90% power, needs larger sample)

Step 5: Read Results

  • Sample size per variant: Users needed in EACH group (A and B)
  • Total sample size: Users needed across both groups
  • Test duration estimate: Days needed (if you input daily traffic)
📐

The Formula Explained

Sample size calculation uses this formula (for two proportions):

Standard Formula

n = 2 × (Zα/2 + Zβ)² × p̂(1 - p̂) / δ² Where: n = Sample size per group Zα/2 = Z-score for significance level (1.96 for α = 0.05) Zβ = Z-score for power (0.84 for 80% power) p̂ = (p₁ + p₂) / 2 (pooled proportion) δ = |p₁ - p₂| (absolute difference) p₁ = Baseline conversion rate p₂ = Expected conversion rate after treatment

Step-by-Step Example

Scenario: Swiggy wants to test "Free Delivery Above ₹99" banner.

Inputs:

  • Baseline conversion: 4.0%
  • Minimum detectable effect: 10% relative lift (4.0% → 4.4%)
  • Significance level: α = 0.05 (95% confidence)
  • Power: 0.80 (80% chance to detect effect)

Step 1: Convert Relative to Absolute Difference

p₁ = 0.04 (baseline) Relative lift = 10% p₂ = p₁ × (1 + 0.10) = 0.04 × 1.10 = 0.044 δ = p₂ - p₁ = 0.044 - 0.04 = 0.004 (absolute difference)

Step 2: Calculate Pooled Proportion

p̂ = (p₁ + p₂) / 2 = (0.04 + 0.044) / 2 = 0.042

Step 3: Look Up Z-Scores

Zα/2 = 1.96 (for α = 0.05, two-tailed) Zβ = 0.84 (for power = 0.80)

Step 4: Calculate Sample Size

n = 2 × (1.96 + 0.84)² × 0.042 × (1 - 0.042) / 0.004² = 2 × (2.8)² × 0.042 × 0.958 / 0.000016 = 2 × 7.84 × 0.0402 / 0.000016 = 0.631 / 0.000016 = 39,437 users per group Total sample size = 2 × 39,437 = 78,874 users

Step 5: Estimate Test Duration

If Swiggy gets 50,000 daily users: Duration = 78,874 / 50,000 = 1.58 days ≈ 2 days If smaller site with 5,000 daily users: Duration = 78,874 / 5,000 = 15.8 days ≈ 16 days

Z-Score Table

| Significance Level (α) | Confidence | Zα/2 | |------------------------|------------|------| | 0.10 | 90% | 1.645 | | 0.05 | 95% | 1.96 | | 0.01 | 99% | 2.576 |

| Power (1-β) | Zβ | |-------------|-----| | 0.70 | 0.52 | | 0.80 | 0.84 | | 0.90 | 1.28 | | 0.95 | 1.645 |

Info

Quick Rule: For typical A/B tests (5% baseline, 10% relative lift, α=0.05, power=0.80), you need roughly 40,000-60,000 users per group. Lower baseline rates or smaller effects require larger samples.

⚠️ CheckpointQuiz error: Missing or invalid options array

💼

Real-World Sample Size Examples

Example 1: Flipkart — Product Image Hover Zoom

Goal: Test hover-to-zoom vs click-to-enlarge product images.

Inputs:

Baseline conversion rate: 3.5% Minimum detectable effect: 5% relative lift (3.5% → 3.675%) Significance level: α = 0.05 Power: 0.80 Calculator result: - Sample size per group: 127,680 users - Total sample size: 255,360 users - Estimated duration: 5 days (50K daily users per group)

Decision: Test is feasible (5 days is reasonable). Run A/B test.

Actual Result:

Control: 3.48% conversion (128K users) Treatment: 3.71% conversion (128K users) Lift: +6.6% (p = 0.002, significant)

Outcome: Detected effect (6.6%) was larger than MDE (5%), test succeeded. Deployed hover-zoom → ₹300Cr additional annual revenue.


Example 2: Swiggy — Delivery Time Promise Badge

Goal: Test "Delivers in 30 min" badge vs no badge.

Inputs:

Baseline order rate: 5.0% MDE: 10% relative lift (5.0% → 5.5%) α = 0.05, Power = 0.80 Calculator result: - Sample size per group: 31,376 users - Total sample size: 62,752 users - Duration: 1.3 days (50K daily users)

Decision: Test is very feasible (1-2 days). Run test.

Actual Result:

Control: 5.02% (31.5K users) Treatment: 5.64% (31.5K users) Lift: +12.4% (p < 0.001, highly significant)

Outcome: Effect (12.4%) exceeded MDE (10%), test succeeded. Deployed badge to all users.


Example 3: Zomato — Restaurant Menu Redesign

Goal: Test new menu layout (grid vs list).

Inputs:

Baseline order rate: 8.0% MDE: 3% relative lift (8.0% → 8.24%) α = 0.01 (stricter, major redesign), Power = 0.90 (high power) Calculator result: - Sample size per group: 312,540 users - Total sample size: 625,080 users - Duration: 31 days (20K daily users)

Decision: Test is long (31 days) but important decision (major redesign). Run test with close monitoring.

Actual Result (after 31 days):

Control: 7.98% (313K users) Treatment: 7.92% (313K users) Difference: -0.75% (p = 0.28, NOT significant)

Outcome: No significant difference detected. Keep old design (don't deploy). Avoided costly redesign that wouldn't improve business metric.

Lesson: High sample size + strict α (0.01) prevented false positive. If they had tested with only 10K users (underpowered), might have deployed based on noise.


Example 4: Startup with Limited Traffic

Goal: Test pricing change ($99 → $79).

Inputs:

Baseline conversion: 2.0% MDE: 20% relative lift (2.0% → 2.4%) α = 0.05, Power = 0.80 Calculator result: - Sample size per group: 6,227 users - Total: 12,454 users - Duration: 62 days (100 daily users)

Decision: 62 days is TOO LONG (seasonality, market changes). What to do?

Options:

Option A: Increase MDE to 50% (2.0% → 3.0%):

New sample size: 1,094 per group (2,188 total) Duration: 22 days (more feasible) Trade-off: Can only detect large effects (50% lift)

Option B: Use holdout test (deploy to 90%, keep 10% control):

Deploy $79 to 90% of traffic (90 users/day) Keep $79 pricing for 10% (10 users/day control) Monitor for 90 days (8,100 total sample) Can detect 15-20% lift with this approach

Option C: Run survey + small test:

Survey 500 users: "Would you buy at $79?" (directional insight) Run 2-week A/B test (1,400 users) with α = 0.10 (more lenient) Combine qualitative + quantitative data for decision

Decision: Option A (test 50% lift) is most rigorous. If $79 pricing doesn't increase conversion 50%+, it's not worth the revenue loss per transaction.

📊

Factors That Affect Sample Size

1. Baseline Conversion Rate

Lower baseline = Larger sample needed

| Baseline Rate | Sample Size (10% relative lift, α=0.05, power=0.80) | |---------------|-----------------------------------------------------| | 1% | 93,000 per group | | 2% | 46,000 per group | | 5% | 18,000 per group | | 10% | 8,800 per group | | 20% | 4,200 per group |

Why: Lower baseline rates have less "signal" (fewer conversions), need more data to detect changes.


2. Minimum Detectable Effect (MDE)

Smaller effect = Exponentially larger sample needed

| MDE (Relative) | Baseline 5% | Sample Size per Group | |----------------|-------------|-----------------------| | 50% lift (5% → 7.5%) | 5% | 1,240 | | 20% lift (5% → 6%) | 5% | 7,350 | | 10% lift (5% → 5.5%) | 5% | 28,400 | | 5% lift (5% → 5.25%) | 5% | 112,000 | | 2% lift (5% → 5.1%) | 5% | 696,000 |

Why: Small effects are buried in noise, need massive samples to separate signal from randomness.

Rule: If you can't detect <10% lift with available traffic, test isn't worth running (effect too small to matter).


3. Significance Level (α)

Stricter α = Larger sample needed

| α | Confidence | Zα/2 | Sample Size Multiplier | |---|------------|------|------------------------| | 0.10 | 90% | 1.645 | 0.76× (smaller sample) | | 0.05 | 95% | 1.96 | 1.0× (baseline) | | 0.01 | 99% | 2.576 | 1.7× (70% larger) |

When to use stricter α:

  • High-stakes decisions (major redesign, pricing changes)
  • Irreversible changes (can't easily revert)
  • Multiple tests running (Bonferroni correction)

4. Statistical Power (1-β)

Higher power = Larger sample needed

| Power | β (False Negative Rate) | Zβ | Sample Size Multiplier | |-------|-------------------------|-----|------------------------| | 0.70 | 30% | 0.52 | 0.74× (smaller) | | 0.80 | 20% | 0.84 | 1.0× (baseline) | | 0.90 | 10% | 1.28 | 1.47× (47% larger) | | 0.95 | 5% | 1.645 | 1.94× (94% larger) |

When to use higher power:

  • Low-traffic sites (can't afford false negatives)
  • Expensive tests (development costs high, must detect effects)
  • Scientific research (publication standards)

5. One-Tailed vs Two-Tailed Test

One-tailed = ~20% smaller sample needed

Two-tailed (standard): Zα/2 = 1.96 (α = 0.05) One-tailed: Zα = 1.645 (α = 0.05) Sample size reduction: (1.645 / 1.96)² = 0.71 (29% smaller)

When to use one-tailed:

  • Pre-registered directional hypothesis ("Treatment is BETTER, not just different")
  • Certain treatment won't hurt (rare)
  • NOT to reduce sample size artificially (that's p-hacking)

Default: Use two-tailed (safer, allows detection of both positive and negative effects).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}