Why Data Analysts Need Statistics
Statistics is the science of collecting, analyzing, and interpreting data. As a data analyst, statistics is your superpower — it helps you answer questions, test hypotheses, and make predictions from messy real-world data.
What Statistics Does for Analysts
1. Summarize Data (Descriptive Statistics)
- Take 10,000 customer orders → Summarize as "Average order value: ₹1,250"
- Reduce complexity: Turn millions of data points into a few key metrics
- Communicate clearly: "Sales increased 15%" is better than showing raw numbers
2. Make Inferences (Inferential Statistics)
- Survey 1,000 customers → Predict what all 10 million customers think
- A/B test 5,000 users → Decide which website version to show all users
- Estimate the unknown: You can't measure everyone, so you sample and infer
3. Quantify Uncertainty
- "We're 95% confident sales will be between ₹50L and ₹60L this month"
- Statistics tells you HOW SURE you can be, not just WHAT you found
- Confidence intervals, p-values, significance — all about quantifying uncertainty
4. Prove (or Disprove) Hypotheses
- "Does adding free shipping increase conversions?"
- Run A/B test → Use statistics to say "Yes, with 99% confidence" or "No significant difference"
- Avoid false conclusions: Prevents you from seeing patterns in random noise
Statistics in Daily Analyst Work
Example 1: Swiggy Delivery Times
- Question: Are delivery times longer on weekends?
- Data: 100,000 orders (delivery time in minutes)
- Statistics Used:
- Mean delivery time (weekday vs weekend)
- T-test: Is the difference statistically significant? (Not just random variation)
- Result: "Weekend deliveries are 7 minutes slower (p < 0.001) — significant"
Example 2: Flipkart Pricing Experiment
- Question: Does showing "Limited Stock" badge increase purchases?
- Data: 50,000 users (25K control, 25K treatment)
- Statistics Used:
- Conversion rate (control: 2.3%, treatment: 2.8%)
- Z-test: Is 0.5% difference significant or luck?
- Result: "21% increase, statistically significant — roll out to all users"
Example 3: Zomato Restaurant Rating
- Question: Is 4.2★ rating reliably better than 4.0★?
- Data: Restaurant A (4.2★, 50 reviews), Restaurant B (4.0★, 5,000 reviews)
- Statistics Used:
- Standard error (how uncertain is each rating?)
- Restaurant B's 4.0★ is MORE RELIABLE (more samples = less uncertainty)
- Result: "Restaurant B safer bet — larger sample size"
Statistics is like weather forecasting. Meteorologists can't measure temperature at every square meter, so they sample data, use statistics to model patterns, and give you a prediction with confidence levels ("80% chance of rain"). Data analysts do the same: sample data, find patterns, make predictions with known uncertainty.
Two Types of Statistics: Descriptive vs Inferential
Statistics divides into two branches — one describes what you have, the other predicts what you don't.
Descriptive Statistics (What IS)
Purpose: Summarize and describe data you already have.
Common Techniques:
- Measures of central tendency: Mean, median, mode (typical value)
- Measures of spread: Standard deviation, variance, range (how varied data is)
- Frequency distributions: Histograms, bar charts (how data is distributed)
- Summary tables: Count, sum, min, max, percentiles
Example — E-commerce Sales Analysis:
Dataset: 10,000 orders from January 2025
Descriptive Statistics:
- Total sales: ₹1.25 crore
- Average order value: ₹1,250
- Median order value: ₹980 (half of orders are above/below this)
- Standard deviation: ₹540 (typical variation from average)
- Minimum order: ₹150 (someone bought a phone case)
- Maximum order: ₹85,000 (someone bought a laptop + accessories)
- 95th percentile: ₹3,200 (95% of orders are below this amount)
When to Use:
- Exploring new datasets (EDA — exploratory data analysis)
- Creating dashboards (KPIs are descriptive stats)
- Communicating to stakeholders ("Here's what happened last month")
Inferential Statistics (What COULD BE)
Purpose: Make predictions or conclusions about a larger population based on a sample.
Common Techniques:
- Hypothesis testing: T-tests, chi-square tests (is difference real or random?)
- Confidence intervals: "True average is between ₹1,200 and ₹1,300 (95% confidence)"
- Regression analysis: Predict sales based on ad spend
- A/B testing: Which website version performs better?
Example — Customer Survey:
Population: 10 million Flipkart customers
Sample: Survey 2,000 customers about new feature
Results:
- 68% like new feature (in sample)
- 95% Confidence Interval: [66%, 70%]
- Inference: "Between 66% and 70% of ALL 10M customers likely approve"
- Margin of error: ±2% (with 95% confidence)
Action: If target was 60% approval, you've exceeded it — launch feature
When to Use:
- A/B tests (sample of users → decision for all users)
- Market research (survey 1,000 people → predict city-wide behavior)
- Quality control (test 100 products → estimate defect rate in 1M production run)
- Forecasting (historical data → predict future trends)
Key Difference: Sample vs Population
| Aspect | Descriptive | Inferential | |--------|-------------|-------------| | Data Scope | Entire dataset | Sample of larger population | | Goal | Summarize what you have | Predict what you don't have | | Uncertainty | No uncertainty (exact values) | Quantifies uncertainty (confidence intervals) | | Questions | "What IS the average?" | "What WILL BE the average?" | | Example | Last month's average revenue | Next month's predicted revenue |
In practice, you use BOTH. Start with descriptive statistics (explore data), then use inferential statistics (make decisions). Descriptive tells you "what happened," inferential tells you "what to do next."
⚠️ CheckpointQuiz error: Missing or invalid options array
Essential Statistical Concepts for Analysts
Here are 7 concepts you'll use daily as a data analyst.
1. Mean, Median, Mode (Central Tendency)
Where you see this: Every dashboard, every summary table.
- Mean (average): Sum all values ÷ count
- Order values: ₹100, ₹150, ₹200, ₹500, ₹10,000 → Mean = ₹2,190 (skewed by outlier)
- Median (middle value): Sort data, pick middle value
- Same data → Median = ₹200 (more representative when outliers exist)
- Mode (most common): Value that appears most frequently
- Shoe sizes: 7, 7, 8, 8, 8, 9, 10 → Mode = 8 (most sold size)
When to use which: Covered in detail in next topic (Mean, Median, Mode).
2. Standard Deviation & Variance (Spread)
Where you see this: Risk analysis, quality control, anomaly detection.
- Variance: Average squared difference from mean (how spread out data is)
- Standard Deviation (SD): Square root of variance (same units as original data)
Example — Delivery Time Consistency:
- Restaurant A: Average 30 min, SD 5 min (consistent: 25-35 min range)
- Restaurant B: Average 30 min, SD 15 min (inconsistent: 15-45 min range)
- Insight: Restaurant A is more reliable despite same average
3. Normal Distribution (Bell Curve)
Where you see this: Everywhere in nature and business.
- Shape: Symmetric bell curve, most data near mean, fewer at extremes
- 68-95-99.7 Rule:
- 68% of data within 1 SD of mean
- 95% within 2 SD
- 99.7% within 3 SD
- Example: Heights, test scores, measurement errors, website load times
Why it matters: Many statistical tests ASSUME normal distribution (T-tests, regression). Always check this assumption.
4. Correlation vs Causation
Where you see this: Every analysis involving relationships between variables.
- Correlation: Two variables move together (might be coincidence)
- Example: Ice cream sales and drowning deaths both increase in summer (correlated, not causal)
- Causation: One variable DIRECTLY CAUSES change in another
- Example: Adding free shipping CAUSES higher conversion rates (proven via A/B test)
Rule: Correlation ≠ Causation. Need experiments (A/B tests) to prove causation.
5. P-Value (Statistical Significance)
Where you see this: A/B tests, hypothesis tests, regression outputs.
- Definition: Probability that observed difference is due to random chance
- Interpretation:
- p < 0.05 (5%): Result is statistically significant (likely real effect)
- p ≥ 0.05: Not significant (could be random noise)
- Example: A/B test shows 2.5% vs 2.8% conversion. P-value = 0.03 → Significant difference (not luck).
6. Confidence Intervals
Where you see this: Survey results, forecasts, any estimate from sample data.
- Definition: Range where true population value likely falls
- Example: "Average order value: ₹1,250 (95% CI: ₹1,200 - ₹1,300)"
- Interpretation: If you repeated this study 100 times, 95 times the true average would fall in this range
Why 95%?: Industry standard (balances confidence and precision). Some use 90% or 99% depending on risk tolerance.
7. Sample Size & Power
Where you see this: Planning A/B tests, surveys, experiments.
- Sample size: How many observations you need for reliable results
- Statistical power: Probability of detecting a real effect (if it exists)
- Trade-off: Larger sample = more confident, but more expensive/time-consuming
Example — A/B Test Planning:
- Want to detect 5% conversion lift (from 2% to 2.1%)
- Need ~30,000 users per variant (60K total)
- If you only have 1,000 users → Underpowered (can't detect small changes reliably)
Statistics in Analyst Workflow
Here's how you actually USE statistics in a typical project.
Step-by-Step: Analyzing Swiggy Delivery Time Trends
Business Question: Are delivery times increasing over time? Should we investigate operations?
Step 1: Descriptive Statistics (Explore)
# Load data: 100,000 deliveries from Jan-Mar 2025
import pandas as pd
df = pd.read_csv('deliveries.csv')
# Summary statistics
df['delivery_time_min'].describe()
# Output:
# count 100000.0
# mean 32.5
# std 8.2
# min 10.0
# 25% 27.0
# 50% 31.0 ← Median
# 75% 37.0
# max 95.0Insights:
- Average delivery: 32.5 minutes
- Median: 31 minutes (close to mean → not heavily skewed)
- Standard deviation: 8.2 minutes (typical variation)
- Some outliers: Max 95 minutes (investigate these)
Step 2: Visualize Distribution
import matplotlib.pyplot as plt
df['delivery_time_min'].hist(bins=50)
plt.xlabel('Delivery Time (minutes)')
plt.ylabel('Frequency')
plt.title('Distribution of Delivery Times')
plt.show()Observation: Roughly normal distribution (bell curve) with slight right skew (long tail of slow deliveries).
Step 3: Compare Time Periods (Inferential Statistics)
# Split data: Jan vs Mar
jan = df[df['month'] == 1]['delivery_time_min']
mar = df[df['month'] == 3]['delivery_time_min']
# Descriptive comparison
print(f"Jan mean: {jan.mean():.1f} min") # 31.2 min
print(f"Mar mean: {mar.mean():.1f} min") # 33.8 min
# Difference: 2.6 minutes increaseQuestion: Is 2.6 min difference statistically significant or random variation?
Answer: Run T-test (compares means of two groups)
from scipy import stats
t_stat, p_value = stats.ttest_ind(jan, mar)
print(f"P-value: {p_value:.4f}") # 0.0001Interpretation:
- P-value = 0.0001 (< 0.05) → Statistically significant
- Delivery times ARE increasing (not random)
- Action: Alert operations team, investigate causes (more orders, traffic, driver shortage?)
Step 4: Confidence Interval (Quantify Increase)
mean_diff = mar.mean() - jan.mean() # 2.6 min
se = stats.sem(mar - jan) # Standard error of difference
ci = stats.t.interval(0.95, len(mar)-1, loc=mean_diff, scale=se)
print(f"95% CI: [{ci[0]:.1f}, {ci[1]:.1f}] minutes")
# Output: [2.2, 3.0] minutesInterpretation: True increase is between 2.2 and 3.0 minutes (with 95% confidence).
Step 5: Communicate to Stakeholders
Bad: "March deliveries are slower." Good: "Delivery times increased 2.6 minutes (8% slower) from January to March. This increase is statistically significant (p < 0.001) and consistent across all cities. We estimate true increase is 2.2-3.0 minutes (95% confidence). Recommend investigating operational capacity."
Key: Use statistics to be PRECISE and CONFIDENT, not vague.
Statistics transforms "I think deliveries are slower" into "Deliveries are 2.6 minutes slower (95% CI: 2.2-3.0 min, p < 0.001) — significant and actionable." This is the power of quantifying uncertainty.
Common Statistical Mistakes to Avoid
Even experienced analysts make these errors. Knowing them keeps you from misleading stakeholders.
Mistake 1: Confusing Correlation with Causation
Example: "Cities with more data analysts have higher GDP. Therefore, hiring data analysts increases GDP." Problem: Correlation ≠ causation. Maybe rich cities can afford more analysts (reverse causality). Or third factor (tech industry) causes both. Fix: Use experiments (A/B tests) or causal inference methods (regression with controls).
Mistake 2: Using Mean When You Should Use Median
Example: Average salary at startup: ₹12 lakhs. Sounds great! Reality: CEO earns ₹1 crore, 9 employees earn ₹5 lakhs. Mean = ₹14.5L (misleading). Median = ₹5L (typical employee). Fix: Use median for skewed data (income, order values, website load times).
Mistake 3: Ignoring Sample Size
Example: Restaurant A (4.5★, 10 reviews) vs Restaurant B (4.3★, 5,000 reviews). Choose A? Problem: Small samples have high uncertainty. A's rating could be luck. Fix: Consider sample size. B's 4.3★ is more reliable. Check confidence intervals.
Mistake 4: P-Hacking (Cherry-Picking Results)
Example: Run 20 A/B tests. One shows p = 0.04 (significant). Declare victory! Problem: With 20 tests, 1 false positive is expected (5% error rate). You found noise, not signal. Fix: Pre-register hypothesis, use Bonferroni correction (stricter p-value for multiple tests), or split data (train/validation sets).
Mistake 5: Stopping A/B Test Too Early
Example: Day 1 of A/B test shows 10% lift (p = 0.03). Stop test and declare winner. Problem: Early data is noisy. P-values fluctuate. Stopping early inflates false positives (peeking problem). Fix: Pre-calculate required sample size, wait until you hit it. Don't peek at p-values mid-test.
Mistake 6: Assuming Normality Without Checking
Example: Run T-test on website load times (heavily right-skewed: most fast, few very slow). Problem: T-test assumes normality. Skewed data violates assumption → invalid results. Fix: Check distribution (histogram), use non-parametric tests (Mann-Whitney U) for skewed data, or log-transform data to normalize.
Mistake 7: Extrapolating Beyond Data Range
Example: Revenue grows 20% per month for 6 months. Extrapolate: "We'll make ₹100Cr in 2 years!" Problem: Linear/exponential extrapolation breaks at scale (market saturation, competition). Fix: Build realistic models with constraints, use domain knowledge, don't blindly extrapolate trends.
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}