Model Evaluation Metrics
Learn to measure how well your model performs
Model Evaluation Metrics
Why Metrics Matter?
"Accuracy" isn't always enough. You need the right metric for your problem.
Regression Metrics
R² Score (Coefficient of Determination)
How much variance is explained (0 to 1):
from sklearn.metrics import r2_score
y_true = [3, 5, 2.5, 7]
y_pred = [2.8, 5.2, 2.3, 6.8]
r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.3f}") # 0.985 = excellentMean Squared Error (MSE)
Penalizes large errors more:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.3f}")Root Mean Squared Error (RMSE)
Same units as target:
import numpy as np
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse:.3f}")Mean Absolute Error (MAE)
Average error (less sensitive to outliers):
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.3f}")Classification Metrics
Accuracy
Correct predictions / Total predictions:
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0]
acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc:.0%}") # 75%Warning: Accuracy is misleading with imbalanced data!
Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# Predicted
# 0 1
# True 0 [[TN, FP],
# 1 [FN, TP]]- TN (True Negative): Correctly predicted negative
- FP (False Positive): Predicted positive, actually negative
- FN (False Negative): Predicted negative, actually positive
- TP (True Positive): Correctly predicted positive
Precision
Of predicted positive, how many are actually positive?
from sklearn.metrics import precision_score
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.0%}")Use when: False positives are costly (spam detection)
Recall (Sensitivity)
Of actual positive, how many were predicted positive?
from sklearn.metrics import recall_score
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.0%}")Use when: False negatives are costly (disease detection)
F1 Score
Balance between precision and recall:
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred)
print(f"F1: {f1:.3f}")Use when: You need both precision and recall
Classification Report
All metrics at once:
from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))ROC Curve and AUC
Visualize model performance at different thresholds:
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Need probability predictions
y_prob = [0.9, 0.2, 0.8, 0.6, 0.3, 0.85, 0.4, 0.1]
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob)
# Calculate AUC (Area Under Curve)
auc = roc_auc_score(y_true, y_prob)
print(f"AUC: {auc:.3f}")
# Plot
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--') # Random classifier
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()AUC Interpretation:
- 1.0 = Perfect
- 0.5 = Random guessing
- < 0.5 = Worse than random
Choosing the Right Metric
| Problem | Metric | Why |
|---|---|---|
| Balanced classes | Accuracy | Simple, works well |
| Imbalanced classes | F1, AUC | Accuracy is misleading |
| False positive costly | Precision | Minimize FP |
| False negative costly | Recall | Minimize FN |
| Regression | RMSE, MAE | Measures error size |
Complete Example
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
confusion_matrix)
from sklearn.datasets import load_breast_cancer
import numpy as np
# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42
)
# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
# All metrics
print("=== Model Evaluation ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.1%}")
print(f"Precision: {precision_score(y_test, y_pred):.1%}")
print(f"Recall: {recall_score(y_test, y_pred):.1%}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))Key Points
- Accuracy alone is not enough
- Use confusion matrix to understand errors
- Precision: Minimize false positives
- Recall: Minimize false negatives
- F1: Balance of precision and recall
- AUC: Overall model quality
- Choose metric based on business problem
What's Next?
Learn about Cross-Validation.