What is A/B Test Sample Size Calculator — Plan Powerful Experiments?

Free A/B test sample size calculator. Calculate required users for statistically significant experiments. Input baseline rate, minimum detectable effect, significance level, and power.

Is A/B Test Sample Size Calculator — Plan Powerful Experiments suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 8 min to complete and includes 10 interactive quizzes to test your understanding.

How long does it take to learn A/B Test Sample Size Calculator — Plan Powerful Experiments?

You can complete this topic in about 8 min. The topic is part 48 of undefined in our comprehensive Data Analytics Learning Path.

A/B Test Sample Size Calculator — Free Tool | DataPath

🎯

Why Sample Size Calculation Matters

Sample size determines whether your A/B test can detect meaningful differences with statistical confidence.

The Problem: Underpowered Tests

Scenario: Flipkart wants to test new checkout flow.

Without Sample Size Calculation:

"Let's test for 3 days and see what happens"
Day 3: Control 2.5% conversion (5,000 users), Treatment 2.7% conversion (5,000 users)
Difference: +0.2% (8% relative lift)
P-value: 0.18 (not significant)

Conclusion: "New checkout doesn't work" ❌

Reality: Test was underpowered (sample too small). The 8% lift might be REAL, but you can't detect it with only 5,000 users per group.

With Sample Size Calculation:

Inputs:
- Baseline: 2.5% conversion
- MDE: 8% relative lift (2.5% → 2.7%)
- Significance: α = 0.05
- Power: 80%

Calculator says: Need 62,340 users per group (124,680 total)

Run test properly:
Day 15: Control 2.5% (62K users), Treatment 2.7% (62K users)
P-value: 0.03 (significant!)

Conclusion: "New checkout increases conversion 8%" ✓

What Sample Size Calculation Tells You

How many users needed — Per variant (e.g., 50,000 per group = 100,000 total)
How long test will take — Days = (Sample size) / (Daily traffic × % in test)
Whether test is feasible — If you need 1M users but only get 10K/day, test takes 100 days (too long)

Think of it this way...

Sample size is like survey margin of error. Polling 100 people gives ±10% margin (useless for close elections). Polling 1,000 gives ±3% margin (useful). Sample size calculation ensures your A/B test has enough "resolution" to detect small differences confidently.

🧮

A/B Test Sample Size Calculator

Use this calculator to determine required sample size for your A/B test.

A/B Test Sample Size Calculator

Calculator functionality coming soon...

How to Use This Calculator

Step 1: Enter Baseline Conversion Rate

Current conversion rate (before treatment)
Example: If 500 out of 10,000 users convert → 5% baseline

Step 2: Enter Minimum Detectable Effect (MDE)

Smallest change you care about detecting
Can enter as relative % (e.g., 10% = 5.0% → 5.5%) or absolute % (e.g., 0.5% = 5.0% → 5.5%)

Step 3: Set Significance Level (α)

Probability of false positive (claiming difference when there is none)
Standard: 0.05 (5% false positive rate)
Conservative: 0.01 (1% false positive rate, needs larger sample)

Step 4: Set Statistical Power

Probability of detecting real difference (if it exists)
Standard: 0.80 (80% power, 20% false negative rate)
High power: 0.90 (90% power, needs larger sample)

Step 5: Read Results

Sample size per variant: Users needed in EACH group (A and B)
Total sample size: Users needed across both groups
Test duration estimate: Days needed (if you input daily traffic)

📐

The Formula Explained

Sample size calculation uses this formula (for two proportions):

Standard Formula

n = 2 × (Zα/2 + Zβ)² × p̂(1 - p̂) / δ²

Where:
n = Sample size per group
Zα/2 = Z-score for significance level (1.96 for α = 0.05)
Zβ = Z-score for power (0.84 for 80% power)
p̂ = (p₁ + p₂) / 2 (pooled proportion)
δ = |p₁ - p₂| (absolute difference)
p₁ = Baseline conversion rate
p₂ = Expected conversion rate after treatment

Step-by-Step Example

Scenario: Swiggy wants to test "Free Delivery Above ₹99" banner.

Inputs:

Baseline conversion: 4.0%
Minimum detectable effect: 10% relative lift (4.0% → 4.4%)
Significance level: α = 0.05 (95% confidence)
Power: 0.80 (80% chance to detect effect)

Step 1: Convert Relative to Absolute Difference

p₁ = 0.04 (baseline)
Relative lift = 10%
p₂ = p₁ × (1 + 0.10) = 0.04 × 1.10 = 0.044
δ = p₂ - p₁ = 0.044 - 0.04 = 0.004 (absolute difference)

Step 2: Calculate Pooled Proportion

p̂ = (p₁ + p₂) / 2
  = (0.04 + 0.044) / 2
  = 0.042

Step 3: Look Up Z-Scores

Zα/2 = 1.96 (for α = 0.05, two-tailed)
Zβ = 0.84 (for power = 0.80)

Step 4: Calculate Sample Size

n = 2 × (1.96 + 0.84)² × 0.042 × (1 - 0.042) / 0.004²
  = 2 × (2.8)² × 0.042 × 0.958 / 0.000016
  = 2 × 7.84 × 0.0402 / 0.000016
  = 0.631 / 0.000016
  = 39,437 users per group

Total sample size = 2 × 39,437 = 78,874 users

Step 5: Estimate Test Duration

If Swiggy gets 50,000 daily users:
Duration = 78,874 / 50,000 = 1.58 days ≈ 2 days

If smaller site with 5,000 daily users:
Duration = 78,874 / 5,000 = 15.8 days ≈ 16 days

Z-Score Table

| Significance Level (α) | Confidence | Zα/2 | |------------------------|------------|------| | 0.10 | 90% | 1.645 | | 0.05 | 95% | 1.96 | | 0.01 | 99% | 2.576 |

| Power (1-β) | Zβ | |-------------|-----| | 0.70 | 0.52 | | 0.80 | 0.84 | | 0.90 | 1.28 | | 0.95 | 1.645 |

Info

Quick Rule: For typical A/B tests (5% baseline, 10% relative lift, α=0.05, power=0.80), you need roughly 40,000-60,000 users per group. Lower baseline rates or smaller effects require larger samples.

⚠️ CheckpointQuiz error: Missing or invalid options array

💼

Real-World Sample Size Examples

Example 1: Flipkart — Product Image Hover Zoom

Goal: Test hover-to-zoom vs click-to-enlarge product images.

Inputs:

Baseline conversion rate: 3.5%
Minimum detectable effect: 5% relative lift (3.5% → 3.675%)
Significance level: α = 0.05
Power: 0.80

Calculator result:
- Sample size per group: 127,680 users
- Total sample size: 255,360 users
- Estimated duration: 5 days (50K daily users per group)

Decision: Test is feasible (5 days is reasonable). Run A/B test.

Actual Result:

Control: 3.48% conversion (128K users)
Treatment: 3.71% conversion (128K users)
Lift: +6.6% (p = 0.002, significant)

Outcome: Detected effect (6.6%) was larger than MDE (5%), test succeeded. Deployed hover-zoom → ₹300Cr additional annual revenue.

Example 2: Swiggy — Delivery Time Promise Badge

Goal: Test "Delivers in 30 min" badge vs no badge.

Inputs:

Baseline order rate: 5.0%
MDE: 10% relative lift (5.0% → 5.5%)
α = 0.05, Power = 0.80

Calculator result:
- Sample size per group: 31,376 users
- Total sample size: 62,752 users
- Duration: 1.3 days (50K daily users)

Decision: Test is very feasible (1-2 days). Run test.

Actual Result:

Control: 5.02% (31.5K users)
Treatment: 5.64% (31.5K users)
Lift: +12.4% (p < 0.001, highly significant)

Outcome: Effect (12.4%) exceeded MDE (10%), test succeeded. Deployed badge to all users.

Example 3: Zomato — Restaurant Menu Redesign

Goal: Test new menu layout (grid vs list).

Inputs:

Baseline order rate: 8.0%
MDE: 3% relative lift (8.0% → 8.24%)
α = 0.01 (stricter, major redesign), Power = 0.90 (high power)

Calculator result:
- Sample size per group: 312,540 users
- Total sample size: 625,080 users
- Duration: 31 days (20K daily users)

Decision: Test is long (31 days) but important decision (major redesign). Run test with close monitoring.

Actual Result (after 31 days):

Control: 7.98% (313K users)
Treatment: 7.92% (313K users)
Difference: -0.75% (p = 0.28, NOT significant)

Outcome: No significant difference detected. Keep old design (don't deploy). Avoided costly redesign that wouldn't improve business metric.

Lesson: High sample size + strict α (0.01) prevented false positive. If they had tested with only 10K users (underpowered), might have deployed based on noise.

Example 4: Startup with Limited Traffic

Goal: Test pricing change ($99 → $79).

Inputs:

Baseline conversion: 2.0%
MDE: 20% relative lift (2.0% → 2.4%)
α = 0.05, Power = 0.80

Calculator result:
- Sample size per group: 6,227 users
- Total: 12,454 users
- Duration: 62 days (100 daily users)

Decision: 62 days is TOO LONG (seasonality, market changes). What to do?

Options:

Option A: Increase MDE to 50% (2.0% → 3.0%):

New sample size: 1,094 per group (2,188 total)
Duration: 22 days (more feasible)
Trade-off: Can only detect large effects (50% lift)

Option B: Use holdout test (deploy to 90%, keep 10% control):

Deploy $79 to 90% of traffic (90 users/day)
Keep $79 pricing for 10% (10 users/day control)
Monitor for 90 days (8,100 total sample)
Can detect 15-20% lift with this approach

Option C: Run survey + small test:

Survey 500 users: "Would you buy at $79?" (directional insight)
Run 2-week A/B test (1,400 users) with α = 0.10 (more lenient)
Combine qualitative + quantitative data for decision

Decision: Option A (test 50% lift) is most rigorous. If $79 pricing doesn't increase conversion 50%+, it's not worth the revenue loss per transaction.

📊

Factors That Affect Sample Size

1. Baseline Conversion Rate

Lower baseline = Larger sample needed

| Baseline Rate | Sample Size (10% relative lift, α=0.05, power=0.80) | |---------------|-----------------------------------------------------| | 1% | 93,000 per group | | 2% | 46,000 per group | | 5% | 18,000 per group | | 10% | 8,800 per group | | 20% | 4,200 per group |

Why: Lower baseline rates have less "signal" (fewer conversions), need more data to detect changes.

2. Minimum Detectable Effect (MDE)

Smaller effect = Exponentially larger sample needed

| MDE (Relative) | Baseline 5% | Sample Size per Group | |----------------|-------------|-----------------------| | 50% lift (5% → 7.5%) | 5% | 1,240 | | 20% lift (5% → 6%) | 5% | 7,350 | | 10% lift (5% → 5.5%) | 5% | 28,400 | | 5% lift (5% → 5.25%) | 5% | 112,000 | | 2% lift (5% → 5.1%) | 5% | 696,000 |

Why: Small effects are buried in noise, need massive samples to separate signal from randomness.

Rule: If you can't detect <10% lift with available traffic, test isn't worth running (effect too small to matter).

3. Significance Level (α)

Stricter α = Larger sample needed

| α | Confidence | Zα/2 | Sample Size Multiplier | |---|------------|------|------------------------| | 0.10 | 90% | 1.645 | 0.76× (smaller sample) | | 0.05 | 95% | 1.96 | 1.0× (baseline) | | 0.01 | 99% | 2.576 | 1.7× (70% larger) |

When to use stricter α:

High-stakes decisions (major redesign, pricing changes)
Irreversible changes (can't easily revert)
Multiple tests running (Bonferroni correction)

4. Statistical Power (1-β)

Higher power = Larger sample needed

| Power | β (False Negative Rate) | Zβ | Sample Size Multiplier | |-------|-------------------------|-----|------------------------| | 0.70 | 30% | 0.52 | 0.74× (smaller) | | 0.80 | 20% | 0.84 | 1.0× (baseline) | | 0.90 | 10% | 1.28 | 1.47× (47% larger) | | 0.95 | 5% | 1.645 | 1.94× (94% larger) |

When to use higher power:

Low-traffic sites (can't afford false negatives)
Expensive tests (development costs high, must detect effects)
Scientific research (publication standards)

5. One-Tailed vs Two-Tailed Test

One-tailed = ~20% smaller sample needed

Two-tailed (standard): Zα/2 = 1.96 (α = 0.05)
One-tailed: Zα = 1.645 (α = 0.05)

Sample size reduction: (1.645 / 1.96)² = 0.71 (29% smaller)

When to use one-tailed:

Pre-registered directional hypothesis ("Treatment is BETTER, not just different")
Certain treatment won't hurt (rare)
NOT to reduce sample size artificially (that's p-hacking)

Default: Use two-tailed (safer, allows detection of both positive and negative effects).

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}