4 min read min read
Removing Duplicates
Learn to find and remove duplicate rows
Removing Duplicates
What are Duplicates?
Duplicates are rows that appear more than once. They happen when:
- Same data entered twice
- Merging tables creates copies
- Data import errors
Find Duplicates
code.py
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'John', 'Mike', 'Sarah'],
'Age': [25, 30, 25, 28, 30]
})
print(df)
# Check which rows are duplicates
print(df.duplicated())Output:
Name Age
0 John 25
1 Sarah 30
2 John 25 <- duplicate of row 0
3 Mike 28
4 Sarah 30 <- duplicate of row 1
0 False
1 False
2 True <- duplicate!
3 False
4 True <- duplicate!
Count Duplicates
code.py
# How many duplicates?
print(df.duplicated().sum()) # Output: 2See the Duplicate Rows
code.py
# Show only the duplicate rows
duplicates = df[df.duplicated()]
print(duplicates)Remove Duplicates
code.py
# Remove duplicates (keeps first occurrence)
clean_df = df.drop_duplicates()
print(clean_df)Output:
Name Age
0 John 25
1 Sarah 30
3 Mike 28
Keep Last Instead of First
code.py
# Keep last occurrence instead of first
clean_df = df.drop_duplicates(keep='last')
print(clean_df)Check Duplicates in Specific Columns
Sometimes only some columns should be unique:
code.py
df = pd.DataFrame({
'Email': ['a@mail.com', 'b@mail.com', 'a@mail.com'],
'Name': ['John', 'Sarah', 'Johnny']
})
# Check duplicates only in Email column
df['Is_Dup'] = df.duplicated(subset=['Email'])
print(df)Output:
Email Name Is_Dup
0 a@mail.com John False
1 b@mail.com Sarah False
2 a@mail.com Johnny True <- same email
Remove Duplicates by Specific Columns
code.py
# Remove if same email (even if name different)
clean_df = df.drop_duplicates(subset=['Email'])
print(clean_df)Output:
Email Name
0 a@mail.com John
1 b@mail.com Sarah
Count Unique Values
code.py
# How many unique names?
print(df['Name'].nunique())
# See unique values
print(df['Name'].unique())
# Count each value
print(df['Name'].value_counts())Key Points
- duplicated() finds duplicate rows
- drop_duplicates() removes them
- keep='first' (default) keeps first occurrence
- keep='last' keeps last occurrence
- subset=['col'] checks only specific columns
- nunique() counts unique values
Common Mistake
code.py
# This doesn't change the original df!
df.drop_duplicates()
# You need to reassign or use inplace
df = df.drop_duplicates()
# OR
df.drop_duplicates(inplace=True)What's Next?
Learn to validate your data - check if values make sense and flag errors.