Why Sample Size Calculation Matters
Sample size determines whether your A/B test can detect meaningful differences with statistical confidence.
The Problem: Underpowered Tests
Scenario: Flipkart wants to test new checkout flow.
Without Sample Size Calculation:
"Let's test for 3 days and see what happens"
Day 3: Control 2.5% conversion (5,000 users), Treatment 2.7% conversion (5,000 users)
Difference: +0.2% (8% relative lift)
P-value: 0.18 (not significant)
Conclusion: "New checkout doesn't work" ❌
Reality: Test was underpowered (sample too small). The 8% lift might be REAL, but you can't detect it with only 5,000 users per group.
With Sample Size Calculation:
Inputs:
- Baseline: 2.5% conversion
- MDE: 8% relative lift (2.5% → 2.7%)
- Significance: α = 0.05
- Power: 80%
Calculator says: Need 62,340 users per group (124,680 total)
Run test properly:
Day 15: Control 2.5% (62K users), Treatment 2.7% (62K users)
P-value: 0.03 (significant!)
Conclusion: "New checkout increases conversion 8%" ✓
What Sample Size Calculation Tells You
- How many users needed — Per variant (e.g., 50,000 per group = 100,000 total)
- How long test will take — Days = (Sample size) / (Daily traffic × % in test)
- Whether test is feasible — If you need 1M users but only get 10K/day, test takes 100 days (too long)
Sample size is like survey margin of error. Polling 100 people gives ±10% margin (useless for close elections). Polling 1,000 gives ±3% margin (useful). Sample size calculation ensures your A/B test has enough "resolution" to detect small differences confidently.
A/B Test Sample Size Calculator
Use this calculator to determine required sample size for your A/B test.
A/B Test Sample Size Calculator
Calculator functionality coming soon...
How to Use This Calculator
Step 1: Enter Baseline Conversion Rate
- Current conversion rate (before treatment)
- Example: If 500 out of 10,000 users convert → 5% baseline
Step 2: Enter Minimum Detectable Effect (MDE)
- Smallest change you care about detecting
- Can enter as relative % (e.g., 10% = 5.0% → 5.5%) or absolute % (e.g., 0.5% = 5.0% → 5.5%)
Step 3: Set Significance Level (α)
- Probability of false positive (claiming difference when there is none)
- Standard: 0.05 (5% false positive rate)
- Conservative: 0.01 (1% false positive rate, needs larger sample)
Step 4: Set Statistical Power
- Probability of detecting real difference (if it exists)
- Standard: 0.80 (80% power, 20% false negative rate)
- High power: 0.90 (90% power, needs larger sample)
Step 5: Read Results
- Sample size per variant: Users needed in EACH group (A and B)
- Total sample size: Users needed across both groups
- Test duration estimate: Days needed (if you input daily traffic)
The Formula Explained
Sample size calculation uses this formula (for two proportions):
Standard Formula
n = 2 × (Zα/2 + Zβ)² × p̂(1 - p̂) / δ²
Where:
n = Sample size per group
Zα/2 = Z-score for significance level (1.96 for α = 0.05)
Zβ = Z-score for power (0.84 for 80% power)
p̂ = (p₁ + p₂) / 2 (pooled proportion)
δ = |p₁ - p₂| (absolute difference)
p₁ = Baseline conversion rate
p₂ = Expected conversion rate after treatment
Step-by-Step Example
Scenario: Swiggy wants to test "Free Delivery Above ₹99" banner.
Inputs:
- Baseline conversion: 4.0%
- Minimum detectable effect: 10% relative lift (4.0% → 4.4%)
- Significance level: α = 0.05 (95% confidence)
- Power: 0.80 (80% chance to detect effect)
Step 1: Convert Relative to Absolute Difference
p₁ = 0.04 (baseline)
Relative lift = 10%
p₂ = p₁ × (1 + 0.10) = 0.04 × 1.10 = 0.044
δ = p₂ - p₁ = 0.044 - 0.04 = 0.004 (absolute difference)
Step 2: Calculate Pooled Proportion
p̂ = (p₁ + p₂) / 2
= (0.04 + 0.044) / 2
= 0.042
Step 3: Look Up Z-Scores
Zα/2 = 1.96 (for α = 0.05, two-tailed)
Zβ = 0.84 (for power = 0.80)
Step 4: Calculate Sample Size
n = 2 × (1.96 + 0.84)² × 0.042 × (1 - 0.042) / 0.004²
= 2 × (2.8)² × 0.042 × 0.958 / 0.000016
= 2 × 7.84 × 0.0402 / 0.000016
= 0.631 / 0.000016
= 39,437 users per group
Total sample size = 2 × 39,437 = 78,874 users
Step 5: Estimate Test Duration
If Swiggy gets 50,000 daily users:
Duration = 78,874 / 50,000 = 1.58 days ≈ 2 days
If smaller site with 5,000 daily users:
Duration = 78,874 / 5,000 = 15.8 days ≈ 16 days
Z-Score Table
| Significance Level (α) | Confidence | Zα/2 | |------------------------|------------|------| | 0.10 | 90% | 1.645 | | 0.05 | 95% | 1.96 | | 0.01 | 99% | 2.576 |
| Power (1-β) | Zβ | |-------------|-----| | 0.70 | 0.52 | | 0.80 | 0.84 | | 0.90 | 1.28 | | 0.95 | 1.645 |
Quick Rule: For typical A/B tests (5% baseline, 10% relative lift, α=0.05, power=0.80), you need roughly 40,000-60,000 users per group. Lower baseline rates or smaller effects require larger samples.
⚠️ CheckpointQuiz error: Missing or invalid options array
Real-World Sample Size Examples
Example 1: Flipkart — Product Image Hover Zoom
Goal: Test hover-to-zoom vs click-to-enlarge product images.
Inputs:
Baseline conversion rate: 3.5%
Minimum detectable effect: 5% relative lift (3.5% → 3.675%)
Significance level: α = 0.05
Power: 0.80
Calculator result:
- Sample size per group: 127,680 users
- Total sample size: 255,360 users
- Estimated duration: 5 days (50K daily users per group)
Decision: Test is feasible (5 days is reasonable). Run A/B test.
Actual Result:
Control: 3.48% conversion (128K users)
Treatment: 3.71% conversion (128K users)
Lift: +6.6% (p = 0.002, significant)
Outcome: Detected effect (6.6%) was larger than MDE (5%), test succeeded. Deployed hover-zoom → ₹300Cr additional annual revenue.
Example 2: Swiggy — Delivery Time Promise Badge
Goal: Test "Delivers in 30 min" badge vs no badge.
Inputs:
Baseline order rate: 5.0%
MDE: 10% relative lift (5.0% → 5.5%)
α = 0.05, Power = 0.80
Calculator result:
- Sample size per group: 31,376 users
- Total sample size: 62,752 users
- Duration: 1.3 days (50K daily users)
Decision: Test is very feasible (1-2 days). Run test.
Actual Result:
Control: 5.02% (31.5K users)
Treatment: 5.64% (31.5K users)
Lift: +12.4% (p < 0.001, highly significant)
Outcome: Effect (12.4%) exceeded MDE (10%), test succeeded. Deployed badge to all users.
Example 3: Zomato — Restaurant Menu Redesign
Goal: Test new menu layout (grid vs list).
Inputs:
Baseline order rate: 8.0%
MDE: 3% relative lift (8.0% → 8.24%)
α = 0.01 (stricter, major redesign), Power = 0.90 (high power)
Calculator result:
- Sample size per group: 312,540 users
- Total sample size: 625,080 users
- Duration: 31 days (20K daily users)
Decision: Test is long (31 days) but important decision (major redesign). Run test with close monitoring.
Actual Result (after 31 days):
Control: 7.98% (313K users)
Treatment: 7.92% (313K users)
Difference: -0.75% (p = 0.28, NOT significant)
Outcome: No significant difference detected. Keep old design (don't deploy). Avoided costly redesign that wouldn't improve business metric.
Lesson: High sample size + strict α (0.01) prevented false positive. If they had tested with only 10K users (underpowered), might have deployed based on noise.
Example 4: Startup with Limited Traffic
Goal: Test pricing change ($99 → $79).
Inputs:
Baseline conversion: 2.0%
MDE: 20% relative lift (2.0% → 2.4%)
α = 0.05, Power = 0.80
Calculator result:
- Sample size per group: 6,227 users
- Total: 12,454 users
- Duration: 62 days (100 daily users)
Decision: 62 days is TOO LONG (seasonality, market changes). What to do?
Options:
Option A: Increase MDE to 50% (2.0% → 3.0%):
New sample size: 1,094 per group (2,188 total)
Duration: 22 days (more feasible)
Trade-off: Can only detect large effects (50% lift)
Option B: Use holdout test (deploy to 90%, keep 10% control):
Deploy $79 to 90% of traffic (90 users/day)
Keep $79 pricing for 10% (10 users/day control)
Monitor for 90 days (8,100 total sample)
Can detect 15-20% lift with this approach
Option C: Run survey + small test:
Survey 500 users: "Would you buy at $79?" (directional insight)
Run 2-week A/B test (1,400 users) with α = 0.10 (more lenient)
Combine qualitative + quantitative data for decision
Decision: Option A (test 50% lift) is most rigorous. If $79 pricing doesn't increase conversion 50%+, it's not worth the revenue loss per transaction.
Factors That Affect Sample Size
1. Baseline Conversion Rate
Lower baseline = Larger sample needed
| Baseline Rate | Sample Size (10% relative lift, α=0.05, power=0.80) | |---------------|-----------------------------------------------------| | 1% | 93,000 per group | | 2% | 46,000 per group | | 5% | 18,000 per group | | 10% | 8,800 per group | | 20% | 4,200 per group |
Why: Lower baseline rates have less "signal" (fewer conversions), need more data to detect changes.
2. Minimum Detectable Effect (MDE)
Smaller effect = Exponentially larger sample needed
| MDE (Relative) | Baseline 5% | Sample Size per Group | |----------------|-------------|-----------------------| | 50% lift (5% → 7.5%) | 5% | 1,240 | | 20% lift (5% → 6%) | 5% | 7,350 | | 10% lift (5% → 5.5%) | 5% | 28,400 | | 5% lift (5% → 5.25%) | 5% | 112,000 | | 2% lift (5% → 5.1%) | 5% | 696,000 |
Why: Small effects are buried in noise, need massive samples to separate signal from randomness.
Rule: If you can't detect <10% lift with available traffic, test isn't worth running (effect too small to matter).
3. Significance Level (α)
Stricter α = Larger sample needed
| α | Confidence | Zα/2 | Sample Size Multiplier | |---|------------|------|------------------------| | 0.10 | 90% | 1.645 | 0.76× (smaller sample) | | 0.05 | 95% | 1.96 | 1.0× (baseline) | | 0.01 | 99% | 2.576 | 1.7× (70% larger) |
When to use stricter α:
- High-stakes decisions (major redesign, pricing changes)
- Irreversible changes (can't easily revert)
- Multiple tests running (Bonferroni correction)
4. Statistical Power (1-β)
Higher power = Larger sample needed
| Power | β (False Negative Rate) | Zβ | Sample Size Multiplier | |-------|-------------------------|-----|------------------------| | 0.70 | 30% | 0.52 | 0.74× (smaller) | | 0.80 | 20% | 0.84 | 1.0× (baseline) | | 0.90 | 10% | 1.28 | 1.47× (47% larger) | | 0.95 | 5% | 1.645 | 1.94× (94% larger) |
When to use higher power:
- Low-traffic sites (can't afford false negatives)
- Expensive tests (development costs high, must detect effects)
- Scientific research (publication standards)
5. One-Tailed vs Two-Tailed Test
One-tailed = ~20% smaller sample needed
Two-tailed (standard): Zα/2 = 1.96 (α = 0.05)
One-tailed: Zα = 1.645 (α = 0.05)
Sample size reduction: (1.645 / 1.96)² = 0.71 (29% smaller)
When to use one-tailed:
- Pre-registered directional hypothesis ("Treatment is BETTER, not just different")
- Certain treatment won't hurt (rare)
- NOT to reduce sample size artificially (that's p-hacking)
Default: Use two-tailed (safer, allows detection of both positive and negative effects).
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}