End-to-end EDA Project
Apply all your skills to analyze a real-world dataset from start to finish
Project Overview
Goal: Analyze the "Driver Analysis" dataset to identify factors contributing to speeding and accidents.
Dataset: We will use a sample dataset containing driver statistics by state.
Skills Applied:
- Pandas for data manipulation
- Seaborn/Matplotlib for visualization
- Statistical analysis
- Data cleaning
Step 1: Load and Inspect
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset (using seaborn's built-in car crashes dataset)
df = sns.load_dataset('car_crashes')
# Inspect
print(df.head())
print(df.info())
print(df.describe())Step 2: Data Cleaning
Check for missing values and duplicates.
# Check missing
print(df.isna().sum())
# Check duplicates
print(df.duplicated().sum())Step 3: Univariate Analysis
Visualize the distribution of total accidents.
# Histogram of total accidents
plt.figure(figsize=(10, 6))
sns.histplot(df['total'], kde=True, color='blue')
plt.title('Distribution of Total Accidents per Billion Miles')
plt.show()
# Boxplot to check outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['total'])
plt.title('Boxplot of Total Accidents')
plt.show()Step 4: Bivariate Analysis
Is there a relationship between alcohol and accidents?
# Scatter plot: Alcohol vs Total
plt.figure(figsize=(10, 6))
sns.scatterplot(x='alcohol', y='total', data=df)
plt.title('Alcohol Consumption vs Total Accidents')
plt.xlabel('Alcohol Consumption')
plt.ylabel('Total Accidents')
plt.show()
# Correlation
corr = df['alcohol'].corr(df['total'])
print(f"Correlation between Alcohol and Accidents: {corr:.2f}")Interpretation: There is a strong positive correlation (usually > 0.8) between alcohol consumption and fatal accidents.
Step 5: Multivariate Analysis
Let's look at the correlation matrix of all variables.
# Correlation Matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Driver Statistics')
plt.show()Step 6: Insights and Conclusions
Based on our analysis:
- Alcohol is a major factor: Strong correlation with total accidents.
- Speeding: Also highly correlated with accidents.
- Insurance Premiums: Interestingly, premiums don't show a strong correlation with accident rates in this dataset (requires further investigation).
Challenge
- Create a new column
safe_driver_scorecombining speeding and alcohol metrics. - Visualize the top 5 safest and most dangerous states (you'll need the state abbreviation column).
- Use Plotly to create an interactive map of accidents by state.
Congratulations!
You have completed the Python for Data Analysis course! You now have the tools to:
- Write Python code
- Manipulate data with Pandas
- Visualize insights with Seaborn/Plotly
- Build basic Machine Learning models
Keep learning and building!
Practice & Experiment
Test your understanding by running Python code directly in your browser.