#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
20 min read

End-to-end EDA Project

Apply all your skills to analyze a real-world dataset from start to finish

Project Overview

Goal: Analyze the "Driver Analysis" dataset to identify factors contributing to speeding and accidents.

Dataset: We will use a sample dataset containing driver statistics by state.

Skills Applied:

  • Pandas for data manipulation
  • Seaborn/Matplotlib for visualization
  • Statistical analysis
  • Data cleaning

Step 1: Load and Inspect

code.py
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset (using seaborn's built-in car crashes dataset)
df = sns.load_dataset('car_crashes')

# Inspect
print(df.head())
print(df.info())
print(df.describe())

Step 2: Data Cleaning

Check for missing values and duplicates.

code.py
# Check missing
print(df.isna().sum())

# Check duplicates
print(df.duplicated().sum())

Step 3: Univariate Analysis

Visualize the distribution of total accidents.

code.py
# Histogram of total accidents
plt.figure(figsize=(10, 6))
sns.histplot(df['total'], kde=True, color='blue')
plt.title('Distribution of Total Accidents per Billion Miles')
plt.show()

# Boxplot to check outliers
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['total'])
plt.title('Boxplot of Total Accidents')
plt.show()

Step 4: Bivariate Analysis

Is there a relationship between alcohol and accidents?

code.py
# Scatter plot: Alcohol vs Total
plt.figure(figsize=(10, 6))
sns.scatterplot(x='alcohol', y='total', data=df)
plt.title('Alcohol Consumption vs Total Accidents')
plt.xlabel('Alcohol Consumption')
plt.ylabel('Total Accidents')
plt.show()

# Correlation
corr = df['alcohol'].corr(df['total'])
print(f"Correlation between Alcohol and Accidents: {corr:.2f}")

Interpretation: There is a strong positive correlation (usually > 0.8) between alcohol consumption and fatal accidents.

Step 5: Multivariate Analysis

Let's look at the correlation matrix of all variables.

code.py
# Correlation Matrix
plt.figure(figsize=(12, 10))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Driver Statistics')
plt.show()

Step 6: Insights and Conclusions

Based on our analysis:

  1. Alcohol is a major factor: Strong correlation with total accidents.
  2. Speeding: Also highly correlated with accidents.
  3. Insurance Premiums: Interestingly, premiums don't show a strong correlation with accident rates in this dataset (requires further investigation).

Challenge

  1. Create a new column safe_driver_score combining speeding and alcohol metrics.
  2. Visualize the top 5 safest and most dangerous states (you'll need the state abbreviation column).
  3. Use Plotly to create an interactive map of accidents by state.

Congratulations!

You have completed the Python for Data Analysis course! You now have the tools to:

  • Write Python code
  • Manipulate data with Pandas
  • Visualize insights with Seaborn/Plotly
  • Build basic Machine Learning models

Keep learning and building!

Practice & Experiment

Test your understanding by running Python code directly in your browser.