Topic 26 of

Python for Data Analysis — Complete Beginner's Guide

SQL gets you data. Python transforms it into insights. Here's everything you need to go from Python beginner to data-fluent analyst in one guide.

📚Beginner
⏱️13 min
10 quizzes
🐍

Why Python for Data Analysis?

Python has become the dominant language for data analysis — and for good reason. While SQL is perfect for querying databases, Python excels at everything that comes after: cleaning messy data, performing complex transformations, statistical analysis, and creating visualizations.

At companies like Swiggy, PhonePe, and Flipkart, data analysts use SQL to extract data and Python to analyze it. Here's why:

Flexibility: Python handles any data format — CSV, JSON, Excel, APIs, databases — without breaking a sweat. Need to merge data from three different sources with different structures? Python makes it straightforward.

Libraries: The Python ecosystem offers specialized tools for every task:

  • Pandas: DataFrames for tabular data manipulation (like Excel on steroids)
  • NumPy: Fast numerical computations and array operations
  • Matplotlib/Seaborn: Publication-quality visualizations
  • Scikit-learn: Machine learning for predictions and clustering

Reproducibility: Unlike Excel, Python scripts are reproducible. Write your analysis once, run it on updated data anytime. No manual copy-pasting, no "I forgot which cells I edited" moments.

Think of it this way...

SQL is like a librarian who helps you find books. Python is what you do with those books — reading, highlighting, summarizing, and connecting ideas across multiple sources to write your own analysis.

⚙️

Setting Up Your Python Environment

Before writing code, you need Python installed and a way to run your analysis. Here's the quickest path:

Install Python with Anaconda

Anaconda is a Python distribution that includes Python + all major data science libraries pre-installed. Download it from anaconda.com/download — it's free.

Once installed, you can run Python in three ways:

1. Jupyter Notebook (recommended for beginners):

$ terminalBash
jupyter notebook

Opens in your browser. Combines code, output, and notes in one document. Perfect for exploratory analysis.

2. VS Code (recommended for serious projects): Install VS Code, add the Python extension, and run .py files directly. Better for scripts you'll reuse.

3. Command line (quick tests):

$ terminalBash
python
>>> 2 + 2
4

Essential Libraries to Install

If you're not using Anaconda, install these manually:

$ terminalBash
pip install pandas numpy matplotlib seaborn jupyter

Verify your installation:

code.pyPython
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
# Should print version numbers without errors
Info

Pro Tip: Create a dedicated folder for your data projects (e.g., C:\data-projects\ or ~/data-projects/). Store your datasets, notebooks, and scripts there. Keep it organized from day one.

⚠️ CheckpointQuiz error: Missing or invalid options array

🚀

Your First Data Analysis in Python

Let's analyze real data from scratch. We'll use a simple Flipkart orders dataset to answer: Which city generates the most revenue?

Step 1: Import Libraries

code.pyPython
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load Data

code.pyPython
# Load CSV file
orders = pd.read_csv('flipkart_orders.csv')

# View first 5 rows
print(orders.head())

# Check shape: (rows, columns)
print(f"Dataset has {orders.shape[0]} rows and {orders.shape[1]} columns")

Sample output:

order_id city amount order_date 0 1001 Mumbai 2500 2026-01-15 1 1002 Delhi 3200 2026-01-16 2 1003 Mumbai 1800 2026-01-16 3 1004 Bangalore 4100 2026-01-17 4 1005 Chennai 2900 2026-01-18 Dataset has 50000 rows and 4 columns

Step 3: Explore the Data

code.pyPython
# Get summary statistics
print(orders.describe())

# Check for missing values
print(orders.isnull().sum())

# Check data types
print(orders.dtypes)

Step 4: Analyze Revenue by City

code.pyPython
# Group by city and sum revenue
revenue_by_city = orders.groupby('city')['amount'].sum().sort_values(ascending=False)

print(revenue_by_city)

Output:

city Mumbai 42350000 Delhi 38200000 Bangalore 35600000 Chennai 28900000 Name: amount, dtype: int64

Step 5: Visualize Results

code.pyPython
# Create bar chart
revenue_by_city.plot(kind='bar', color='steelblue', figsize=(10, 6))
plt.title('Revenue by City', fontsize=16)
plt.xlabel('City', fontsize=12)
plt.ylabel('Revenue (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

That's it! Five steps to go from raw data to insights with visualization. This workflow — load, explore, analyze, visualize — is what data analysts do daily.

🔑

Key Python Concepts for Analysts

You don't need to master all of Python — just the parts relevant to data work. Here are the essentials:

1. Variables and Data Types

code.pyPython
# Numbers
revenue = 125000
growth_rate = 0.23

# Strings
city = "Mumbai"
status = 'delivered'

# Booleans
is_active = True
has_discount = False

# Lists (ordered collection)
cities = ['Mumbai', 'Delhi', 'Bangalore']
revenues = [42000, 38000, 35000]

# Dictionaries (key-value pairs)
order = {
    'order_id': 1001,
    'city': 'Mumbai',
    'amount': 2500
}

2. Indexing and Slicing

code.pyPython
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad']

# Access by index (starts at 0)
print(cities[0])  # Mumbai
print(cities[-1])  # Hyderabad (last element)

# Slicing: [start:end] (end not included)
print(cities[1:3])  # ['Delhi', 'Bangalore']
print(cities[:2])   # First 2: ['Mumbai', 'Delhi']
print(cities[2:])   # From index 2 onward: ['Bangalore', 'Chennai', 'Hyderabad']

3. Loops and Conditionals

code.pyPython
# For loop
cities = ['Mumbai', 'Delhi', 'Bangalore']
for city in cities:
    print(f"Analyzing {city} data...")

# If-elif-else
amount = 2500
if amount > 3000:
    category = 'High'
elif amount > 1000:
    category = 'Medium'
else:
    category = 'Low'

4. Functions (Reusable Code Blocks)

code.pyPython
def calculate_gst(amount, rate=0.18):
    """Calculate GST on an amount"""
    gst = amount * rate
    total = amount + gst
    return total

# Use the function
order_amount = 2500
final_amount = calculate_gst(order_amount)
print(f"Total with GST: ₹{final_amount}")  # ₹2950.0

5. List Comprehensions (Concise Loops)

code.pyPython
# Long way
amounts = [1000, 1500, 2000, 2500]
with_gst = []
for amount in amounts:
    with_gst.append(amount * 1.18)

# Short way (list comprehension)
with_gst = [amount * 1.18 for amount in amounts]
print(with_gst)  # [1180.0, 1770.0, 2360.0, 2950.0]

# With condition
high_value = [amount for amount in amounts if amount > 1500]
print(high_value)  # [2000, 2500]
Info

For Analysts: You'll use loops less often than you think. Pandas has vectorized operations that apply functions to entire columns at once — much faster than looping row by row.

🐼

Introduction to Pandas DataFrames

Pandas is the core library for data analysis in Python. Its central data structure is the DataFrame — a table with rows and columns, like a spreadsheet or SQL result set.

Creating DataFrames

code.pyPython
import pandas as pd

# From a dictionary
data = {
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai'],
    'orders': [1200, 1100, 950, 800],
    'revenue': [42000, 38000, 35000, 28000]
}
df = pd.DataFrame(data)
print(df)

Output:

city orders revenue 0 Mumbai 1200 42000 1 Delhi 1100 38000 2 Bangalore 950 35000 3 Chennai 800 28000

Basic DataFrame Operations

code.pyPython
# Select a column (returns a Series)
print(df['city'])

# Select multiple columns (returns a DataFrame)
print(df[['city', 'revenue']])

# Filter rows
high_revenue = df[df['revenue'] > 35000]
print(high_revenue)

# Add a new column
df['avg_order_value'] = df['revenue'] / df['orders']
print(df)

Reading from Files

code.pyPython
# CSV files (most common)
df = pd.read_csv('orders.csv')

# Excel files
df = pd.read_excel('sales_data.xlsx', sheet_name='Q1_Sales')

# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM orders', conn)

# JSON files (APIs often return JSON)
df = pd.read_json('api_response.json')

Essential DataFrame Methods

code.pyPython
# View data
df.head(10)        # First 10 rows
df.tail(5)         # Last 5 rows
df.sample(10)      # Random 10 rows

# Get info
df.shape           # (rows, columns)
df.info()          # Column types and memory usage
df.describe()      # Statistical summary of numeric columns

# Column operations
df.columns         # List of column names
df.dtypes          # Data type of each column
df['amount'].sum()    # Sum of amount column
df['amount'].mean()   # Average
df['amount'].median() # Median
df['amount'].max()    # Maximum value
Think of it this way...

If you're comfortable with SQL, think of a DataFrame as the result of a SELECT query that you can continue manipulating in memory. df.groupby('city') is like GROUP BY city, df[df['amount'] > 1000] is like WHERE amount > 1000.

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}