5 min read min read
Bivariate Analysis
Learn to analyze relationships between two columns
Bivariate Analysis
What is Bivariate Analysis?
Bivariate = two variables. Looking at how two columns relate.
Questions like:
- Do older people earn more?
- Do men or women buy more?
- Does education affect salary?
Number vs Number
Compare two numeric columns:
code.py
import pandas as pd
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45, 50],
'Salary': [40000, 50000, 55000, 65000, 70000, 80000]
})
# Correlation: do they move together?
print(df['Age'].corr(df['Salary']))Output: 0.98 (very strong positive relationship)
- Close to 1: Both go up together
- Close to -1: One goes up, other goes down
- Close to 0: No relationship
Category vs Number
Compare categories with numbers:
code.py
df = pd.DataFrame({
'Department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'HR'],
'Salary': [50000, 70000, 55000, 75000, 45000, 48000]
})
# Average salary by department
print(df.groupby('Department')['Salary'].mean())Output:
Department
HR 46500
IT 72500
Sales 52500
IT earns the most!
More Stats by Group
code.py
# Multiple stats per group
print(df.groupby('Department')['Salary'].agg(['mean', 'min', 'max', 'count']))Category vs Category
Compare two categorical columns:
code.py
df = pd.DataFrame({
'Gender': ['M', 'F', 'M', 'F', 'M', 'F'],
'Bought': ['Yes', 'Yes', 'No', 'Yes', 'No', 'No']
})
# Cross tabulation
print(pd.crosstab(df['Gender'], df['Bought']))Output:
Bought No Yes
Gender
F 1 2
M 2 1
Add Percentages
code.py
# Percentage by row
print(pd.crosstab(df['Gender'], df['Bought'], normalize='index') * 100)Output:
Bought No Yes
Gender
F 33.33 66.67
M 66.67 33.33
67% of females bought, only 33% of males bought.
Quick Summary by Group
code.py
df = pd.DataFrame({
'City': ['NYC', 'LA', 'NYC', 'LA', 'NYC'],
'Age': [25, 30, 28, 35, 22],
'Salary': [50000, 60000, 55000, 70000, 45000]
})
# Summary for each city
print(df.groupby('City').describe())Key Questions to Ask
- Numbers: What's the correlation?
- Category + Number: What's the average per group?
- Categories: How do combinations distribute?
Key Points
- corr() measures relationship between numbers
- groupby().mean() compares groups
- crosstab() counts category combinations
- Correlation close to 1 or -1 = strong relationship
- Correlation close to 0 = no relationship
What's Next?
Deep dive into correlation analysis and what the numbers mean.