What is NumPy and Why Analysts Need It
NumPy (Numerical Python) is the foundation of Python's scientific computing stack. It provides fast, memory-efficient arrays and mathematical functions — the engine behind Pandas, scikit-learn, and most data science libraries.
Why NumPy Matters for Analysts
Speed: NumPy operations are 10-100x faster than Python lists because they use optimized C code under the hood. Processing millions of numbers? NumPy does it in milliseconds.
Memory Efficiency: NumPy arrays use less memory than Python lists. A list of 1 million integers takes ~8x more memory than a NumPy array.
Vectorization: Apply operations to entire arrays without loops. Instead of iterating through 1 million values, NumPy processes them all at once.
import numpy as np
# Python list approach (slow)
amounts = [2500, 3200, 1800, 4100, 2900]
with_gst = []
for amount in amounts:
with_gst.append(amount * 1.18)
# NumPy array approach (fast, clean)
amounts = np.array([2500, 3200, 1800, 4100, 2900])
with_gst = amounts * 1.18 # Vectorized operation — all at once
print(with_gst) # [2950. 3776. 2124. 4838. 3422.]When to Use NumPy vs Pandas
Use NumPy when:
- You need pure numerical operations (math, statistics, linear algebra)
- You're working with multi-dimensional arrays (matrices, images, tensors)
- Performance is critical and you don't need labeled rows/columns
Use Pandas when:
- You're working with tabular data (rows and columns with labels)
- You need to merge, group, or pivot data
- You want to handle missing data elegantly
In Practice: Most analysts use both — NumPy powers Pandas under the hood, and Pandas makes NumPy easier to use for tabular data.
If Pandas is Excel with programming, NumPy is a high-performance calculator. Pandas gives you tables and labels; NumPy gives you raw speed and mathematical power.
NumPy Arrays — The Core Data Structure
A NumPy array is a grid of values, all of the same type. Unlike Python lists, arrays are fixed-size and homogeneous (all elements must be the same data type).
Creating Arrays
import numpy as np
# From a Python list
arr = np.array([1, 2, 3, 4, 5])
print(arr) # [1 2 3 4 5]
# 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix)
# [[1 2 3]
# [4 5 6]
# [7 8 9]]
# Array of zeros
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
zeros_matrix = np.zeros((3, 4)) # 3 rows, 4 columns
# Array of ones
ones = np.ones(5) # [1. 1. 1. 1. 1.]
# Array with a range of values
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8] (start, stop, step)
# Array with evenly spaced values
linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ] (start, stop, count)
# Random arrays
random_arr = np.random.rand(5) # 5 random values between 0 and 1
random_int = np.random.randint(1, 100, size=10) # 10 random integers between 1 and 99Array Attributes
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # (2, 3) — 2 rows, 3 columns
print(arr.ndim) # 2 — number of dimensions
print(arr.size) # 6 — total number of elements
print(arr.dtype) # dtype('int64') — data type of elementsArray Indexing and Slicing
arr = np.array([10, 20, 30, 40, 50])
# Indexing (like Python lists)
print(arr[0]) # 10 (first element)
print(arr[-1]) # 50 (last element)
# Slicing
print(arr[1:4]) # [20 30 40] (index 1 to 3)
print(arr[:3]) # [10 20 30] (first 3)
print(arr[2:]) # [30 40 50] (from index 2 onward)
# 2D array indexing
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix[0, 0]) # 1 (row 0, column 0)
print(matrix[1, 2]) # 6 (row 1, column 2)
print(matrix[:, 1]) # [2 5 8] (all rows, column 1)
print(matrix[1, :]) # [4 5 6] (row 1, all columns)Boolean Indexing (Filtering)
amounts = np.array([2500, 3200, 1800, 4100, 2900])
# Filter: amounts greater than 3000
high_value = amounts[amounts > 3000]
print(high_value) # [3200 4100]
# Multiple conditions
medium = amounts[(amounts > 2000) & (amounts < 4000)]
print(medium) # [2500 3200 2900]⚠️ CheckpointQuiz error: Missing or invalid options array
Array Operations and Vectorization
NumPy's superpower is vectorization — applying operations to entire arrays without explicit loops.
Arithmetic Operations
amounts = np.array([2500, 3200, 1800, 4100, 2900])
# Scalar operations (applied to every element)
with_gst = amounts * 1.18
print(with_gst) # [2950. 3776. 2124. 4838. 3422.]
discounted = amounts - 500
print(discounted) # [2000 2700 1300 3600 2400]
# Element-wise array operations
revenue_day1 = np.array([45000, 38000, 52000])
revenue_day2 = np.array([48000, 39000, 55000])
total_revenue = revenue_day1 + revenue_day2
print(total_revenue) # [93000 77000 107000]
growth = (revenue_day2 - revenue_day1) / revenue_day1 * 100
print(growth) # [ 6.66666667 2.63157895 5.76923077]Aggregation Functions
amounts = np.array([2500, 3200, 1800, 4100, 2900])
print(amounts.sum()) # 14500 (total)
print(amounts.mean()) # 2900.0 (average)
print(amounts.median()) # 2900.0 (middle value) — wait, this is wrong!
print(np.median(amounts)) # 2900.0 (correct: use np.median, not method)
print(amounts.std()) # 797.18 (standard deviation)
print(amounts.min()) # 1800 (minimum)
print(amounts.max()) # 4100 (maximum)
print(amounts.argmin()) # 2 (index of minimum)
print(amounts.argmax()) # 3 (index of maximum)
# Percentiles
print(np.percentile(amounts, 25)) # 2250.0 (25th percentile)
print(np.percentile(amounts, 75)) # 3550.0 (75th percentile)Axis-Wise Operations on 2D Arrays
# City revenue by day (rows=cities, columns=days)
revenue = np.array([
[45000, 48000, 52000], # Mumbai
[38000, 39000, 41000], # Delhi
[35000, 37000, 36000] # Bangalore
])
# Total revenue per city (sum across columns)
city_totals = revenue.sum(axis=1)
print(city_totals) # [145000 118000 108000]
# Total revenue per day (sum across rows)
day_totals = revenue.sum(axis=0)
print(day_totals) # [118000 124000 129000]
# Average revenue per city
city_avg = revenue.mean(axis=1)
print(city_avg) # [48333.33 39333.33 36000.]Axis Reminder:
axis=0: operate down rows (column-wise aggregation)axis=1: operate across columns (row-wise aggregation)
Universal Functions (ufuncs)
NumPy provides fast mathematical functions that work element-wise:
amounts = np.array([100, 1000, 10000, 100000])
# Logarithm (useful for skewed data)
log_amounts = np.log10(amounts)
print(log_amounts) # [2. 3. 4. 5.]
# Square root
sqrt_amounts = np.sqrt(amounts)
print(sqrt_amounts) # [ 10. 31.62 100. 316.23]
# Exponential
exp_vals = np.exp([1, 2, 3])
print(exp_vals) # [ 2.72 7.39 20.09]
# Rounding
values = np.array([2.3, 4.7, 5.5, 6.2])
print(np.round(values)) # [2. 5. 6. 6.]
print(np.floor(values)) # [2. 4. 5. 6.]
print(np.ceil(values)) # [3. 5. 6. 7.]Statistical Functions for Analysts
NumPy includes functions for common statistical calculations — essential for exploratory analysis.
Descriptive Statistics
# Zomato order amounts
amounts = np.array([450, 680, 520, 890, 340, 720, 550, 480, 650, 920])
# Central tendency
mean = np.mean(amounts) # 620.0 (average)
median = np.median(amounts) # 585.0 (middle value)
# Spread
std = np.std(amounts) # 184.39 (standard deviation)
var = np.var(amounts) # 34000.0 (variance)
range_val = np.ptp(amounts) # 580 (peak-to-peak: max - min)
# Percentiles/Quantiles
q25 = np.percentile(amounts, 25) # 482.5 (25th percentile)
q75 = np.percentile(amounts, 75) # 717.5 (75th percentile)
IQR = q75 - q25 # 235.0 (interquartile range)
print(f"Mean: ₹{mean:.2f}")
print(f"Median: ₹{median:.2f}")
print(f"Std Dev: ₹{std:.2f}")
print(f"IQR: ₹{IQR:.2f}")Correlation and Covariance
# Swiggy: delivery time vs customer rating
delivery_time = np.array([25, 30, 35, 40, 45, 50, 55, 60])
rating = np.array([4.8, 4.7, 4.5, 4.3, 4.0, 3.8, 3.5, 3.2])
# Correlation coefficient (-1 to 1)
correlation = np.corrcoef(delivery_time, rating)[0, 1]
print(f"Correlation: {correlation:.3f}") # -0.998 (strong negative correlation)
# Covariance
covariance = np.cov(delivery_time, rating)[0, 1]
print(f"Covariance: {covariance:.2f}")Handling NaN Values
# Data with missing values
amounts = np.array([2500, np.nan, 1800, 4100, np.nan, 2900])
# Regular mean fails
print(np.mean(amounts)) # nan
# NaN-safe functions
print(np.nanmean(amounts)) # 2825.0 (ignores NaN)
print(np.nanmedian(amounts)) # 2700.0
print(np.nansum(amounts)) # 11300.0
print(np.nanstd(amounts)) # 874.96Random Sampling (for A/B Testing)
# Randomly assign users to test groups
user_ids = np.arange(1, 10001) # 10,000 users
np.random.shuffle(user_ids)
control_group = user_ids[:5000] # First 5000
test_group = user_ids[5000:] # Last 5000
# Random sample with replacement
sample = np.random.choice(amounts, size=100, replace=True)
# Random sample without replacement
sample_unique = np.random.choice(amounts, size=5, replace=False)For Analysts: Use NumPy for pure numerical calculations (mean, std, correlation). Use Pandas when you need to group by categories, handle missing data with business logic, or work with labeled data.
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}