What is a Pandas DataFrame?
A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different types — like a spreadsheet or SQL table in Python. It's the most important data structure for data analysis.
Anatomy of a DataFrame
import pandas as pd
# Create a sample DataFrame
data = {
'order_id': [1001, 1002, 1003, 1004, 1005],
'customer': ['Priya', 'Rahul', 'Anjali', 'Vikas', 'Neha'],
'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Chennai'],
'amount': [2500, 3200, 1800, 4100, 2900],
'status': ['delivered', 'delivered', 'pending', 'delivered', 'cancelled']
}
df = pd.DataFrame(data)
print(df)Output:
order_id customer city amount status
0 1001 Priya Mumbai 2500 delivered
1 1002 Rahul Delhi 3200 delivered
2 1003 Anjali Mumbai 1800 pending
3 1004 Vikas Bangalore 4100 delivered
4 1005 Neha Chennai 2900 cancelled
Key components:
- Index: The row labels (0, 1, 2, 3, 4 by default)
- Columns: The column names ('order_id', 'customer', 'city', 'amount', 'status')
- Values: The actual data in each cell
Creating DataFrames from Different Sources
# From a list of dictionaries (each dict is a row)
orders = [
{'order_id': 1001, 'city': 'Mumbai', 'amount': 2500},
{'order_id': 1002, 'city': 'Delhi', 'amount': 3200},
{'order_id': 1003, 'city': 'Mumbai', 'amount': 1800}
]
df = pd.DataFrame(orders)
# From a CSV file (most common in real work)
df = pd.read_csv('swiggy_orders.csv')
# From Excel
df = pd.read_excel('sales_data.xlsx', sheet_name='March_2026')
# From a SQL query
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM orders WHERE city = "Mumbai"', conn)
# From JSON (API responses)
df = pd.read_json('api_data.json')A DataFrame is like a smart spreadsheet. It has all the structure of Excel (rows, columns, formulas) but with the power of programming — you can process millions of rows instantly, apply complex logic, and never lose track of what you did.
Selecting Columns and Rows
The most common operation in data analysis: getting exactly the data you need. Pandas offers multiple ways to select data.
Selecting Columns
# Single column (returns a Series)
cities = df['city']
print(type(cities)) # <class 'pandas.core.series.Series'>
# Multiple columns (returns a DataFrame)
subset = df[['customer', 'city', 'amount']]
print(subset)
# All columns except some
df_without_status = df.drop(columns=['status'])Selecting Rows by Position with .iloc
.iloc uses integer position (0-indexed, like Python lists):
# First row
first_row = df.iloc[0]
# First 3 rows
first_three = df.iloc[0:3] # or df.iloc[:3]
# Last row
last_row = df.iloc[-1]
# Specific rows (1st, 3rd, 5th)
specific = df.iloc[[0, 2, 4]]
# Rows and columns by position
# Rows 0-2, columns 0-1
subset = df.iloc[0:3, 0:2]Selecting Rows by Label with .loc
.loc uses labels (index and column names):
# Select rows by index label (if index is set to a custom value)
df_indexed = df.set_index('order_id')
order_1001 = df_indexed.loc[1001]
# Select rows and columns by label
# Rows 0-2, columns 'customer' and 'amount'
subset = df.loc[0:2, ['customer', 'amount']]
# Select all rows, specific columns
cities_amounts = df.loc[:, ['city', 'amount']]Filtering Rows with Boolean Indexing
This is how you'll select data most often — using conditions:
# Orders over ₹3000
high_value = df[df['amount'] > 3000]
# Orders from Mumbai
mumbai = df[df['city'] == 'Mumbai']
# Multiple conditions with & (and) or | (or)
mumbai_delivered = df[(df['city'] == 'Mumbai') & (df['status'] == 'delivered')]
# OR condition: Mumbai or Delhi
metro = df[(df['city'] == 'Mumbai') | (df['city'] == 'Delhi')]
# Using .isin() for multiple values (cleaner than multiple OR)
metro = df[df['city'].isin(['Mumbai', 'Delhi', 'Bangalore'])]
# NOT condition with ~
not_cancelled = df[~(df['status'] == 'cancelled')]
# Or simpler:
not_cancelled = df[df['status'] != 'cancelled']Important: When using multiple conditions, always wrap each condition in parentheses and use & for AND, | for OR. Python's and and or keywords don't work with Pandas boolean indexing.
⚠️ CheckpointQuiz error: Missing or invalid options array
Adding and Modifying Data
DataFrames are mutable — you can add columns, modify values, and insert rows.
Adding New Columns
# Add a constant column
df['year'] = 2026
# Calculate from existing columns
df['gst'] = df['amount'] * 0.18
df['total_with_gst'] = df['amount'] + df['gst']
# Conditional column with .apply()
def categorize_order(amount):
if amount > 3000:
return 'High'
elif amount > 1500:
return 'Medium'
else:
return 'Low'
df['category'] = df['amount'].apply(categorize_order)
# Conditional column with np.where (faster for simple conditions)
import numpy as np
df['is_high_value'] = np.where(df['amount'] > 3000, 'Yes', 'No')
# Multiple conditions with np.select
conditions = [
df['amount'] > 3000,
df['amount'] > 1500,
df['amount'] <= 1500
]
choices = ['High', 'Medium', 'Low']
df['category'] = np.select(conditions, choices, default='Unknown')Modifying Existing Values
# Change all values in a column
df['city'] = df['city'].str.upper() # MUMBAI, DELHI, etc.
# Change specific values
df.loc[df['city'] == 'MUMBAI', 'city'] = 'Mumbai'
# Replace specific values
df['status'] = df['status'].replace({'pending': 'in_progress', 'cancelled': 'failed'})
# Replace multiple values with a mapping
city_mapping = {'Mumbai': 'MH', 'Delhi': 'DL', 'Bangalore': 'KA', 'Chennai': 'TN'}
df['state_code'] = df['city'].map(city_mapping)Adding Rows
# Create a new row as a dictionary
new_order = {
'order_id': 1006,
'customer': 'Amit',
'city': 'Pune',
'amount': 3500,
'status': 'delivered'
}
# Append using pd.concat (preferred in modern Pandas)
new_row_df = pd.DataFrame([new_order])
df = pd.concat([df, new_row_df], ignore_index=True)
# Note: df.append() is deprecated as of Pandas 1.4.0Dropping Columns and Rows
# Drop columns
df = df.drop(columns=['gst', 'total_with_gst'])
# Drop rows by index
df = df.drop([0, 2]) # Drop rows at index 0 and 2
# Drop rows by condition (keep only what you want)
df = df[df['status'] != 'cancelled'] # Remove cancelled ordersInspecting and Summarizing DataFrames
Before analyzing data, you need to understand what you're working with. These methods give you quick insights.
Basic Information
# First and last rows
df.head() # First 5 rows
df.head(10) # First 10 rows
df.tail() # Last 5 rows
# Random sample
df.sample(5) # 5 random rows (useful for large datasets)
# Shape: (rows, columns)
print(df.shape) # (1000, 5) means 1000 rows, 5 columns
# Column names
print(df.columns) # Index(['order_id', 'customer', 'city', 'amount', 'status'], dtype='object')
# Data types of each column
print(df.dtypes)
# order_id int64
# customer object
# city object
# amount int64
# status object
# Memory usage
print(df.memory_usage(deep=True))
# Concise summary
df.info()Example output of df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 1000 non-null int64
1 customer 1000 non-null object
2 city 998 non-null object
3 amount 1000 non-null int64
4 status 1000 non-null object
dtypes: int64(2), object(3)
memory usage: 39.2+ KB
Statistical Summary
# Summary statistics for numeric columns
df.describe()Output:
order_id amount
count 1000.000000 1000.000000
mean 1500.500000 2847.300000
std 288.819436 976.234111
min 1001.000000 500.000000
25% 1250.750000 2100.000000
50% 1500.500000 2800.000000
75% 1750.250000 3600.000000
max 2000.000000 5000.000000
Value Counts and Unique Values
# Count occurrences of each unique value
print(df['city'].value_counts())
# Mumbai 350
# Delhi 280
# Bangalore 220
# Chennai 150
# Unique values
print(df['city'].unique())
# array(['Mumbai', 'Delhi', 'Bangalore', 'Chennai'], dtype=object)
# Number of unique values
print(df['city'].nunique()) # 4
# Check for missing values
print(df.isnull().sum())
# order_id 0
# customer 0
# city 2
# amount 0
# status 0
# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)When Swiggy analysts receive a new dataset of restaurant orders, they run df.info(), df.describe(), and df.isnull().sum() first. This 30-second check reveals data types, missing values, and outliers before spending hours on flawed analysis.
Sorting and Ranking Data
Sorting reveals patterns: top customers, worst-performing products, chronological order for time-series analysis.
Sorting by Values
# Sort by one column (ascending)
df_sorted = df.sort_values('amount')
# Sort descending
df_sorted = df.sort_values('amount', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['city', 'amount'], ascending=[True, False])
# Sorts by city A-Z, then within each city by amount highest-first
# Sort by index
df_sorted = df.sort_index()Ranking Values
# Add a rank column (1 = highest amount)
df['amount_rank'] = df['amount'].rank(ascending=False)
# Rank within groups (rank by amount within each city)
df['city_rank'] = df.groupby('city')['amount'].rank(ascending=False)
# Display top 3 orders per city
top_3_per_city = df[df['city_rank'] <= 3].sort_values(['city', 'city_rank'])
print(top_3_per_city[['order_id', 'city', 'amount', 'city_rank']])Finding Top N and Bottom N
# Top 10 orders by amount
top_10 = df.nlargest(10, 'amount')
# Bottom 10 orders
bottom_10 = df.nsmallest(10, 'amount')
# Top 5 per city (requires groupby)
top_5_per_city = df.groupby('city').apply(lambda x: x.nlargest(5, 'amount')).reset_index(drop=True)⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}