End-to-end EDA Project

Project Overview

Goal: Analyze the "Driver Analysis" dataset to identify factors contributing to speeding and accidents.

Dataset: We will use a sample dataset containing driver statistics by state.

Skills Applied:

Pandas for data manipulation
Seaborn/Matplotlib for visualization
Statistical analysis
Data cleaning

Step 1: Load and Inspect

code.pyPython

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset (using seaborn's built-in car crashes dataset)
df = sns.load_dataset('car_crashes')

# Inspect
print(df.head())
print(df.info())
print(df.describe())

Step 2: Data Cleaning

Check for missing values and duplicates.

code.pyPython

# Check missing
print(df.isna().sum())

# Check duplicates
print(df.duplicated().sum())

Step 3: Univariate Analysis

Visualize the distribution of total accidents.

code.pyPython

# Histogram of total accidents
plt.figure(figsize=(10, 6))
sns.histplot(df['total'], kde=True, color='blue')
plt.title('Distribution of Total Accidents per Billion Miles')
plt.show()

# Boxplot to check outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['total'])
plt.title('Boxplot of Total Accidents')
plt.show()

Step 4: Bivariate Analysis

Is there a relationship between alcohol and accidents?

code.pyPython

# Scatter plot: Alcohol vs Total
plt.figure(figsize=(10, 6))
sns.scatterplot(x='alcohol', y='total', data=df)
plt.title('Alcohol Consumption vs Total Accidents')
plt.xlabel('Alcohol Consumption')
plt.ylabel('Total Accidents')
plt.show()

# Correlation
corr = df['alcohol'].corr(df['total'])
print(f"Correlation between Alcohol and Accidents: {corr:.2f}")

Interpretation: There is a strong positive correlation (usually > 0.8) between alcohol consumption and fatal accidents.

Step 5: Multivariate Analysis

Let's look at the correlation matrix of all variables.

code.pyPython

# Correlation Matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Driver Statistics')
plt.show()

Step 6: Insights and Conclusions

Based on our analysis:

Alcohol is a major factor: Strong correlation with total accidents.
Speeding: Also highly correlated with accidents.
Insurance Premiums: Interestingly, premiums don't show a strong correlation with accident rates in this dataset (requires further investigation).

Challenge

Create a new column safe_driver_score combining speeding and alcohol metrics.
Visualize the top 5 safest and most dangerous states (you'll need the state abbreviation column).
Use Plotly to create an interactive map of accidents by state.

Congratulations!

You have completed the Python for Data Analysis course! You now have the tools to:

Write Python code
Manipulate data with Pandas
Visualize insights with Seaborn/Plotly
Build basic Machine Learning models

Keep learning and building!

Project Overview

Step 1: Load and Inspect

Step 2: Data Cleaning

Step 3: Univariate Analysis

Step 4: Bivariate Analysis

Step 5: Multivariate Analysis

Step 6: Insights and Conclusions

Challenge

Congratulations!

Practice & Experiment