#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
5 min read min read

Model Evaluation Metrics

Learn to measure how well your model performs

Model Evaluation Metrics

Why Metrics Matter?

"Accuracy" isn't always enough. You need the right metric for your problem.

Regression Metrics

R² Score (Coefficient of Determination)

How much variance is explained (0 to 1):

code.py
from sklearn.metrics import r2_score

y_true = [3, 5, 2.5, 7]
y_pred = [2.8, 5.2, 2.3, 6.8]

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.3f}")  # 0.985 = excellent

Mean Squared Error (MSE)

Penalizes large errors more:

code.py
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.3f}")

Root Mean Squared Error (RMSE)

Same units as target:

code.py
import numpy as np

rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse:.3f}")

Mean Absolute Error (MAE)

Average error (less sensitive to outliers):

code.py
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.3f}")

Classification Metrics

Accuracy

Correct predictions / Total predictions:

code.py
from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0]

acc = accuracy_score(y_true, y_pred)
print(f"Accuracy: {acc:.0%}")  # 75%

Warning: Accuracy is misleading with imbalanced data!

Confusion Matrix

code.py
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
print(cm)
#        Predicted
#        0    1
# True 0 [[TN, FP],
#      1  [FN, TP]]
  • TN (True Negative): Correctly predicted negative
  • FP (False Positive): Predicted positive, actually negative
  • FN (False Negative): Predicted negative, actually positive
  • TP (True Positive): Correctly predicted positive

Precision

Of predicted positive, how many are actually positive?

code.py
from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.0%}")

Use when: False positives are costly (spam detection)

Recall (Sensitivity)

Of actual positive, how many were predicted positive?

code.py
from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.0%}")

Use when: False negatives are costly (disease detection)

F1 Score

Balance between precision and recall:

code.py
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1: {f1:.3f}")

Use when: You need both precision and recall

Classification Report

All metrics at once:

code.py
from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred))

ROC Curve and AUC

Visualize model performance at different thresholds:

code.py
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Need probability predictions
y_prob = [0.9, 0.2, 0.8, 0.6, 0.3, 0.85, 0.4, 0.1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob)

# Calculate AUC (Area Under Curve)
auc = roc_auc_score(y_true, y_prob)
print(f"AUC: {auc:.3f}")

# Plot
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')  # Random classifier
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

AUC Interpretation:

  • 1.0 = Perfect
  • 0.5 = Random guessing
  • < 0.5 = Worse than random

Choosing the Right Metric

ProblemMetricWhy
Balanced classesAccuracySimple, works well
Imbalanced classesF1, AUCAccuracy is misleading
False positive costlyPrecisionMinimize FP
False negative costlyRecallMinimize FN
RegressionRMSE, MAEMeasures error size

Complete Example

code.py
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, classification_report,
                             confusion_matrix)
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Train
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# All metrics
print("=== Model Evaluation ===")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.1%}")
print(f"Precision: {precision_score(y_test, y_pred):.1%}")
print(f"Recall:    {recall_score(y_test, y_pred):.1%}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"AUC:       {roc_auc_score(y_test, y_prob):.3f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Key Points

  • Accuracy alone is not enough
  • Use confusion matrix to understand errors
  • Precision: Minimize false positives
  • Recall: Minimize false negatives
  • F1: Balance of precision and recall
  • AUC: Overall model quality
  • Choose metric based on business problem

What's Next?

Learn about Cross-Validation.