Model Evaluation Metrics

Why Metrics Matter?

"Accuracy" isn't always enough. You need the right metric for your problem.

Regression Metrics

R² Score (Coefficient of Determination)

How much variance is explained (0 to 1):

code.pyPython

from sklearn.metrics import r2_score

y_true = [3, 5, 2.5, 7]
y_pred = [2.8, 5.2, 2.3, 6.8]

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.3f}")  # 0.985 = excellent

Mean Squared Error (MSE)

Penalizes large errors more:

code.pyPython

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.3f}")

Root Mean Squared Error (RMSE)

Same units as target:

code.pyPython

import numpy as np

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse:.3f}")

Mean Absolute Error (MAE)

Average error (less sensitive to outliers):

code.pyPython

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.3f}")

Classification Metrics

Accuracy

Correct predictions / Total predictions:

code.pyPython

from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0]

acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc:.0%}")  # 75%

Warning: Accuracy is misleading with imbalanced data!

Confusion Matrix

code.pyPython

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
print(cm)
#        Predicted
#        0    1
# True 0 [[TN, FP],
#      1  [FN, TP]]

TN (True Negative): Correctly predicted negative
FP (False Positive): Predicted positive, actually negative
FN (False Negative): Predicted negative, actually positive
TP (True Positive): Correctly predicted positive

Precision

Of predicted positive, how many are actually positive?

code.pyPython

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.0%}")

Use when: False positives are costly (spam detection)

Recall (Sensitivity)

Of actual positive, how many were predicted positive?

code.pyPython

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.0%}")

Use when: False negatives are costly (disease detection)

F1 Score

Balance between precision and recall:

code.pyPython

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1: {f1:.3f}")

Use when: You need both precision and recall

Classification Report

All metrics at once:

code.pyPython

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred))

ROC Curve and AUC

Visualize model performance at different thresholds:

code.pyPython

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Need probability predictions
y_prob = [0.9, 0.2, 0.8, 0.6, 0.3, 0.85, 0.4, 0.1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob)

# Calculate AUC (Area Under Curve)
auc = roc_auc_score(y_true, y_prob)
print(f"AUC: {auc:.3f}")

# Plot
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')  # Random classifier
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

AUC Interpretation:

1.0 = Perfect
0.5 = Random guessing
< 0.5 = Worse than random

Choosing the Right Metric

Problem	Metric	Why
Balanced classes	Accuracy	Simple, works well
Imbalanced classes	F1, AUC	Accuracy is misleading
False positive costly	Precision	Minimize FP
False negative costly	Recall	Minimize FN
Regression	RMSE, MAE	Measures error size

Complete Example

code.pyPython

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, classification_report,
                             confusion_matrix)
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# All metrics
print("=== Model Evaluation ===")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.1%}")
print(f"Precision: {precision_score(y_test, y_pred):.1%}")
print(f"Recall:    {recall_score(y_test, y_pred):.1%}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"AUC:       {roc_auc_score(y_test, y_prob):.3f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Key Points

Accuracy alone is not enough
Use confusion matrix to understand errors
Precision: Minimize false positives
Recall: Minimize false negatives
F1: Balance of precision and recall
AUC: Overall model quality
Choose metric based on business problem

What's Next?

Learn about Cross-Validation.