Why Statistics for Analysts?
Statistics helps you: โ Summarize large datasets โ Identify patterns and outliers โ Test hypotheses โ Quantify uncertainty โ Make data-backed predictions
Descriptive Statistics
Measures of Central Tendency
Mean (Average):
Mean = Sum of all values / Count
Example: [10, 20, 30] โ Mean = 20
Median (Middle Value):
Sort data, pick middle value
Example: [10, 15, 100] โ Median = 15
(Better than mean when outliers exist!)
Mode (Most Frequent):
Example: [1, 2, 2, 3] โ Mode = 2
Measures of Spread
Range:
Range = Max - Min
Example: [10, 50] โ Range = 40
Variance: Average of squared differences from mean
High variance = Data is spread out
Low variance = Data is clustered
Standard Deviation (ฯ): Square root of variance
ยฑ1ฯ contains ~68% of data
ยฑ2ฯ contains ~95% of data
ยฑ3ฯ contains ~99.7% of data
Probability Basics
Probability = Favorable outcomes / Total outcomes
Example: Probability of rolling 6 on dice = 1/6 = 16.7%
Key Concepts
- Independent events: Coin flip outcomes don't affect each other
- Dependent events: Drawing cards without replacement
- Conditional probability: P(A|B) = Probability of A given B happened
Correlation vs Causation
Correlation
Measures relationship strength between two variables (-1 to +1).
- +1 = Perfect positive (both increase together)
- 0 = No relationship
- -1 = Perfect negative (one increases, other decreases)
Example: Ice cream sales correlate with drowning deaths (both increase in summer), but ice cream doesn't cause drowning!
Causation
One variable directly causes change in another.
Establish causation:
- Correlation exists
- Temporal order (cause before effect)
- No confounding variables
- Controlled experiment
Normal Distribution (Bell Curve)
Most data in nature follows a bell curve.
Properties:
- Mean = Median = Mode (center)
- Symmetric
- 68-95-99.7 rule applies
Real examples:
- Heights
- Test scores
- Measurement errors
Hypothesis Testing
The Process
-
State hypothesis
- H0 (Null): No effect
- H1 (Alternative): There is an effect
-
Collect data
-
Calculate p-value
- Probability of seeing results if H0 is true
-
Decide
- p < 0.05: Reject H0 (statistically significant!)
- p โฅ 0.05: Fail to reject H0
Example
Question: Did new website design increase sales?
- H0: New design has no effect
- H1: New design increases sales
Results: p-value = 0.02
Conclusion: Reject H0. Design likely increased sales (95% confidence).
Confidence Intervals
Range where true value likely lies.
Example: "Average customer age is 35 ยฑ 2 years (95% CI)" = We're 95% confident true average is between 33-37.
Statistical Significance
p-value < 0.05 = Statistically significant
What it means:
- Less than 5% chance result is due to random chance
- NOT the same as "important" or "large effect"
Example:
- Finding: Website tweak increases clicks by 0.1%
- p-value: 0.001 (highly significant!)
- But: 0.1% increase might not be business-relevant
Common Distributions
| Distribution | Use Case | |--------------|----------| | Normal | Heights, test scores | | Binomial | Success/failure (coin flips) | | Poisson | Event counts (website visits/hour) | | Uniform | Random number generators |
Real Example: Salary Analysis
Dataset: 1000 employee salaries
Questions to answer:
-
What's typical salary?
- Mean: โน52,000
- Median: โน48,000 (better - not affected by CEO's โน5M salary!)
-
How spread out are salaries?
- Standard deviation: โน15,000
- 68% of employees earn โน37K-โน67K
-
Are salaries normally distributed?
- Check histogram - if bell-shaped, yes
-
Do engineers earn more than designers?
- Hypothesis test
- p-value = 0.03
- Yes, statistically significant difference!
Python for Statistics
import pandas as pd
import numpy as np
df = pd.read_csv('sales.csv')
# Descriptive stats
print(df['revenue'].mean())
print(df['revenue'].median())
print(df['revenue'].std())
# Correlation
correlation = df['ad_spend'].corr(df['revenue'])
print(f"Correlation: {correlation}")
# Hypothesis test (t-test)
from scipy import stats
group_a = df[df['variant']=='A']['conversion_rate']
group_b = df[df['variant']=='B']['conversion_rate']
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"p-value: {p_value}")Common Mistakes
โ Confusing correlation with causation โ Remember: Correlation โ Causation
โ p-hacking (testing until you get p<0.05) โ Define hypothesis before collecting data
โ Ignoring sample size โ Larger samples = more reliable results
โ Assuming significance = importance โ Consider practical significance too
Summary
โ Mean vs Median (use median for outliers) โ Standard deviation measures spread โ Correlation โ Causation โ p < 0.05 = Statistically significant โ Confidence intervals quantify uncertainty โ Hypothesis testing proves/disproves theories
Next: A/B Testing & Experimentation! ๐งช