Join Operations

What is Join?

Join is like merge, but it uses the index (row labels) to connect DataFrames.

Think of it like matching books by their library code number.

Basic Join Example

code.py

import pandas as pd

# Create DataFrames with named index
students = pd.DataFrame({
    'name': ['John', 'Sarah', 'Mike']
}, index=['S1', 'S2', 'S3'])

scores = pd.DataFrame({
    'marks': [85, 92, 78]
}, index=['S1', 'S2', 'S3'])

print("Students:")
print(students)
print("\nScores:")
print(scores)

Output:

Students:
     name
S1   John
S2  Sarah
S3   Mike

Scores:
    marks
S1     85
S2     92
S3     78

Now join them:

code.py

result = students.join(scores)
print(result)

Output:

     name  marks
S1   John     85
S2  Sarah     92
S3   Mike     78

Joined using the index (S1, S2, S3).

Types of Joins

Left Join (Default)

Keep all rows from the left DataFrame.

code.py

students = pd.DataFrame({
    'name': ['John', 'Sarah', 'Mike', 'Emma']
}, index=['S1', 'S2', 'S3', 'S4'])

scores = pd.DataFrame({
    'marks': [85, 92]
}, index=['S1', 'S2'])

result = students.join(scores, how='left')
print(result)

Output:

     name  marks
S1   John   85.0
S2  Sarah   92.0
S3   Mike    NaN
S4   Emma    NaN

All students kept. Missing scores are NaN.

Right Join

Keep all rows from the right DataFrame.

code.py

result = students.join(scores, how='right')
print(result)

Output:

     name  marks
S1   John     85
S2  Sarah     92

Inner Join

Keep only matching rows.

code.py

result = students.join(scores, how='inner')
print(result)

Output:

     name  marks
S1   John     85
S2  Sarah     92

Outer Join

Keep all rows from both DataFrames.

code.py

students = pd.DataFrame({
    'name': ['John', 'Sarah']
}, index=['S1', 'S2'])

scores = pd.DataFrame({
    'marks': [85, 78]
}, index=['S1', 'S3'])

result = students.join(scores, how='outer')
print(result)

Output:

     name  marks
S1   John   85.0
S2  Sarah    NaN
S3    NaN   78.0

Join Multiple DataFrames

code.py

names = pd.DataFrame({'name': ['John', 'Sarah']}, index=['S1', 'S2'])
ages = pd.DataFrame({'age': [25, 30]}, index=['S1', 'S2'])
cities = pd.DataFrame({'city': ['NYC', 'LA']}, index=['S1', 'S2'])

# Join all three
result = names.join([ages, cities])
print(result)

Output:

     name  age city
S1   John   25  NYC
S2  Sarah   30   LA

Join with Column Name Conflicts

When both DataFrames have same column name:

code.py

df1 = pd.DataFrame({'score': [85, 92]}, index=['S1', 'S2'])
df2 = pd.DataFrame({'score': [90, 88]}, index=['S1', 'S2'])

result = df1.join(df2, lsuffix='_math', rsuffix='_english')
print(result)

Output:

    score_math  score_english
S1          85             90
S2          92             88

lsuffix = left suffix, rsuffix = right suffix

Join on a Column Instead of Index

code.py

students = pd.DataFrame({
    'student_id': ['S1', 'S2', 'S3'],
    'name': ['John', 'Sarah', 'Mike']
})

scores = pd.DataFrame({
    'marks': [85, 92, 78]
}, index=['S1', 'S2', 'S3'])

# Set student_id as index, then join
result = students.set_index('student_id').join(scores)
print(result)

Output:

              name  marks
student_id
S1            John     85
S2           Sarah     92
S3            Mike     78

Practice Example

code.py

import pandas as pd

# Product info
products = pd.DataFrame({
    'name': ['Laptop', 'Mouse', 'Keyboard'],
    'category': ['Electronics', 'Electronics', 'Electronics']
}, index=['P1', 'P2', 'P3'])

# Product prices
prices = pd.DataFrame({
    'price': [1000, 25, 75]
}, index=['P1', 'P2', 'P3'])

# Product stock
stock = pd.DataFrame({
    'quantity': [50, 200, 100]
}, index=['P1', 'P2', 'P3'])

# Join all together
result = products.join([prices, stock])
print(result)

# Calculate total value
result['total_value'] = result['price'] * result['quantity']
print("\nWith total value:")
print(result)

Key Points

df1.join(df2) joins using the index
Default is left join (keeps all from left)
Use how='inner', 'outer', 'right' for other types
Use lsuffix and rsuffix for same column names
Join multiple DataFrames with a list: df1.join([df2, df3])

Join vs Merge

Feature	Join	Merge
Joins on	Index (default)	Column
Syntax	df1.join(df2)	pd.merge(df1, df2)
Speed	Faster for index	Better for columns

Common Mistakes

Mistake 1: Forgetting to set index

code.py

# If you want to join on 'id' column
# First set it as index
df1 = df1.set_index('id')
df2 = df2.set_index('id')
result = df1.join(df2)

Mistake 2: Same column names without suffix

code.py

# This will error if both have 'score' column
df1.join(df2)

# Add suffixes
df1.join(df2, lsuffix='_left', rsuffix='_right')

What's Next?

You learned concat, merge, and join. Next, you'll learn when to use each one.