Topic 94 of

Data Cleaning Checklist — Step-by-Step Data Quality Guide

80% of analysis time is data cleaning. This checklist ensures you don't miss critical steps: nulls, duplicates, outliers, types, validation. Clean data = reliable insights.

📚Beginner
⏱️8 min
5 quizzes

Data Cleaning Checklist

Phase 1: Initial Assessment

  • [ ] Load data and inspect first/last rows
  • [ ] Check dataset size (rows × columns)
  • [ ] Review column names and data types
  • [ ] Get summary statistics
  • [ ] Check for completely empty columns/rows

Phase 2: Missing Values

  • [ ] Identify columns with missing values
  • [ ] Calculate % missing per column
  • [ ] Decide strategy: drop, impute, or flag
  • [ ] Document missing value decisions

Phase 3: Duplicates

  • [ ] Check for exact duplicate rows
  • [ ] Check for duplicate IDs (primary keys)
  • [ ] Identify partial duplicates (same name, different ID)
  • [ ] Remove or flag duplicates with justification

Phase 4: Data Types

  • [ ] Convert strings to dates where applicable
  • [ ] Cast numeric columns (currently strings)
  • [ ] Standardize categorical values (case, spelling)
  • [ ] Create derived columns (year, month from date)

Phase 5: Outliers

  • [ ] Identify outliers using IQR or z-score
  • [ ] Validate if outliers are errors or real
  • [ ] Decide: cap, remove, or keep with justification
  • [ ] Document outlier handling

Phase 6: Validation

  • [ ] Check value ranges (age 0-120, price > 0)
  • [ ] Validate formats (email, phone, ZIP)
  • [ ] Cross-field validation (end_date > start_date)
  • [ ] Check referential integrity (foreign keys exist)

Phase 7: Standardization

  • [ ] Consistent naming (lowercase, underscores)
  • [ ] Uniform categories (Mumbai vs Bombay → Mumbai)
  • [ ] Date format consistency
  • [ ] Unit consistency (cm vs inches → cm)

Phase 8: Documentation

  • [ ] Document all cleaning steps
  • [ ] Record rows removed and why
  • [ ] Note assumptions made
  • [ ] Save cleaned dataset with clear name
🐍

Python Implementation

Complete Cleaning Pipeline:

code.pyPython
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('raw_data.csv')
print(f"Original shape: {df.shape}")

# 1. Initial inspection
print(df.head())
print(df.info())
print(df.describe())

# 2. Missing values
print("\nMissing values:")
print(df.isnull().sum())
print("\n% Missing:")
print((df.isnull().sum() / len(df)) * 100)

# Drop columns with >50% missing
threshold = 0.5
df = df.loc[:, df.isnull().mean() < threshold]

# Impute numeric with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Impute categorical with mode
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

# 3. Duplicates
print(f"\nDuplicates: {df.duplicated().sum()}")
df = df.drop_duplicates()

# 4. Data types
# Convert date columns
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

# 5. Outliers (cap at 99th percentile)
def cap_outliers(df, column, percentile=0.99):
    upper_limit = df[column].quantile(percentile)
    df[column] = df[column].clip(upper=upper_limit)
    return df

df = cap_outliers(df, 'price', 0.99)

# 6. Validation
# Remove invalid ages
df = df[(df['age'] >= 0) & (df['age'] <= 120)]

# Remove negative prices
df = df[df['price'] > 0]

# 7. Standardization
# Lowercase column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Standardize categorical values
df['city'] = df['city'].str.title()  # Mumbai not MUMBAI
df['city'] = df['city'].replace({'Bombay': 'Mumbai', 'Bengaluru': 'Bangalore'})

print(f"\nFinal shape: {df.shape}")
print(f"Rows removed: {len(pd.read_csv('raw_data.csv')) - len(df)}")

# Save cleaned data
df.to_csv('cleaned_data.csv', index=False)

⚠️ CheckpointQuiz error: Missing or invalid options array

🗄️

SQL Data Cleaning

Common Cleaning Queries:

query.sqlSQL
-- 1. Find duplicates
SELECT customer_id, email, COUNT(*)
FROM customers
GROUP BY customer_id, email
HAVING COUNT(*) > 1;

-- Remove duplicates (keep first occurrence)
DELETE FROM customers
WHERE id NOT IN (
    SELECT MIN(id)
    FROM customers
    GROUP BY customer_id, email
);

-- 2. Handle missing values
-- Find rows with null in critical columns
SELECT *
FROM orders
WHERE customer_id IS NULL OR order_date IS NULL;

-- Replace nulls with default
UPDATE products
SET category = 'Uncategorized'
WHERE category IS NULL;

-- 3. Fix data types
-- Convert string to date
ALTER TABLE orders
ALTER COLUMN order_date TYPE DATE
USING order_date::DATE;

-- 4. Outlier detection (IQR method)
WITH stats AS (
    SELECT
        PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY price) AS q1,
        PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY price) AS q3
    FROM products
)
SELECT *
FROM products, stats
WHERE price < (q1 - 1.5 * (q3 - q1))  -- Lower outliers
   OR price > (q3 + 1.5 * (q3 - q1)); -- Upper outliers

-- 5. Standardize text
UPDATE customers
SET city = INITCAP(city),  -- Title case
    email = LOWER(email);  -- Lowercase emails

-- Replace variations
UPDATE customers
SET city = REPLACE(city, 'Bengaluru', 'Bangalore');
⚠️

Common Data Quality Issues

Issue 1: Inconsistent Categories

Problem: Mumbai, MUMBAI, Bombay, mumbai (4 variations)

Solution:

code.pyPython
# Standardize
df['city'] = df['city'].str.title()  # Mumbai
df['city'] = df['city'].replace({'Bombay': 'Mumbai', 'Bengaluru': 'Bangalore'})

# Verify
print(df['city'].value_counts())

Issue 2: Date Formats

Problem: 2026-03-15, 03/15/2026, 15-Mar-2026 (multiple formats)

Solution:

code.pyPython
# pandas auto-detects most formats
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Custom format
df['date'] = pd.to_datetime(df['date'], format='%d-%b-%Y')

Issue 3: Outliers

Problem: Salaries: [₹5L, ₹6L, ₹5.5L, ₹500L] (₹500L is data entry error or outlier)

Solution:

code.pyPython
# IQR method
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]

# OR cap outliers (keep row, adjust value)
df['salary'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)

Issue 4: Mixed Data Types

Problem: Age column has: 25, "30", "unknown", ""

Solution:

code.pyPython
# Convert to numeric, coerce errors to NaN
df['age'] = pd.to_numeric(df['age'], errors='coerce')

# Then handle nulls
df['age'].fillna(df['age'].median(), inplace=True)

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}