Bivariate Analysis

What is Bivariate Analysis?

Bivariate = two variables. Looking at how two columns relate.

Questions like:

Do older people earn more?
Do men or women buy more?
Does education affect salary?

Number vs Number

Compare two numeric columns:

code.pyPython

import pandas as pd

df = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45, 50],
    'Salary': [40000, 50000, 55000, 65000, 70000, 80000]
})

# Correlation: do they move together?
print(df['Age'].corr(df['Salary']))

Output: 0.98 (very strong positive relationship)

Close to 1: Both go up together
Close to -1: One goes up, other goes down
Close to 0: No relationship

Category vs Number

Compare categories with numbers:

code.pyPython

df = pd.DataFrame({
    'Department': ['Sales', 'IT', 'Sales', 'IT', 'HR', 'HR'],
    'Salary': [50000, 70000, 55000, 75000, 45000, 48000]
})

# Average salary by department
print(df.groupby('Department')['Salary'].mean())

Output:

Department
HR       46500
IT       72500
Sales    52500

IT earns the most!

More Stats by Group

code.pyPython

# Multiple stats per group
print(df.groupby('Department')['Salary'].agg(['mean', 'min', 'max', 'count']))

Category vs Category

Compare two categorical columns:

code.pyPython

df = pd.DataFrame({
    'Gender': ['M', 'F', 'M', 'F', 'M', 'F'],
    'Bought': ['Yes', 'Yes', 'No', 'Yes', 'No', 'No']
})

# Cross tabulation
print(pd.crosstab(df['Gender'], df['Bought']))

Output:

Bought   No  Yes
Gender
F         1    2
M         2    1

Add Percentages

code.pyPython

# Percentage by row
print(pd.crosstab(df['Gender'], df['Bought'], normalize='index') * 100)

Output:

Bought      No    Yes
Gender
F        33.33  66.67
M        66.67  33.33

67% of females bought, only 33% of males bought.

Quick Summary by Group

code.pyPython

df = pd.DataFrame({
    'City': ['NYC', 'LA', 'NYC', 'LA', 'NYC'],
    'Age': [25, 30, 28, 35, 22],
    'Salary': [50000, 60000, 55000, 70000, 45000]
})

# Summary for each city
print(df.groupby('City').describe())

Key Questions to Ask

Numbers: What's the correlation?
Category + Number: What's the average per group?
Categories: How do combinations distribute?

Key Points

corr() measures relationship between numbers
groupby().mean() compares groups
crosstab() counts category combinations
Correlation close to 1 or -1 = strong relationship
Correlation close to 0 = no relationship

What's Next?

Deep dive into correlation analysis and what the numbers mean.