Model Evaluation | Week 3 | Synkoc AI/ML Internship

Synkoc AI/ML Internship · Week 3 · Lesson 10 of 13

Model
Evaluation

How do you know if your ML model is actually good? Master the Confusion Matrix, Precision, Recall, F1-Score, Cross-Validation, and ROC-AUC — the tools every data scientist uses to measure, trust, and compare models.

📊 Confusion Matrix

⚙ Precision / Recall

🔁 Cross-Validation

📈 ROC-AUC

🧑‍💻

Synkoc Instructor

AI/ML Professional · Bangalore

⏳ ~55 minutes
🟢 Intermediate

Why Evaluation Matters

A model that says 99% accuracy sounds great. But what if 99% of your data is the same class? A naive model that predicts everything as one class gets 99% accuracy — and is completely useless.

🚫

Accuracy is Not Enough

In fraud detection, 99.9% of transactions are legitimate. A model predicting "not fraud" always gets 99.9% accuracy but catches zero fraudsters. Accuracy hides this.

fraud dataset: 0.1% fraud
naive model: 99.9% accuracy
frauds caught: 0 ← disaster

⚖️

The Four Evaluation Tools

Confusion Matrix reveals exactly what the model is getting right and wrong. Precision and Recall measure quality for each class. Cross-Validation ensures the score generalises. ROC-AUC measures ranking ability.

from sklearn.metrics import (
  confusion_matrix,
  classification_report,
  cross_val_score, roc_auc_score)

🎯

Train vs Test vs Cross-Val

Training score tells you nothing about real performance. Test score depends on one lucky split. Cross-validation averages across K splits for a reliable, unbiased estimate.

train score: 0.98 (overfitting?)
test score: 0.72 (one split)
cv score: 0.76 +/- 0.03 (reliable)

📊

Real Business Decisions

Medical diagnosis: missing a cancer (False Negative) is catastrophic. Spam filter: flagging a real email as spam (False Positive) is annoying. The right metric depends on which mistake is more costly.

medical: maximise Recall
spam filter: maximise Precision
balanced: use F1-Score

Chapter 1 of 4

01

Confusion Matrix

The foundation of all classification evaluation. A 2x2 grid that shows exactly where your model is correct, and exactly where it is making each type of mistake.

The Confusion Matrix

For binary classification, a 2x2 matrix. Rows are actual classes. Columns are predicted classes. Four cells — TP, FP, FN, TN — tell the complete story.

Predicted

Positive

Negative

Actual Positive

TP85

FN15

Actual Negative

FP8

TN92

✔

True Positive

Model said Positive. Actually Positive. Correct!

✖

False Negative

Model said Negative. Actually Positive. Missed it!

✖

False Positive

Model said Positive. Actually Negative. False alarm!

✔

True Negative

Model said Negative. Actually Negative. Correct!

💡Memory trick: True/False = was the prediction correct? Positive/Negative = what did the model predict? TP and TN are correct. FP and FN are both wrong — in opposite ways.

confusion_matrix.py● LIVE

Chapter 2 of 4

02

Precision, Recall & F1

Three metrics built from the confusion matrix. Each answers a different question about classification quality. Together they give the complete picture.

Precision, Recall & F1

Three complementary metrics. Accuracy ignores class imbalance. These three do not.

🎯

Precision

Of all positive predictions, what fraction were actually positive?

TP / (TP + FP)

= 85 / (85+8)
= 0.914

Use when FP is costly: spam filters

🔎

Recall

Of all actual positives, what fraction did the model find?

TP / (TP + FN)

= 85 / (85+15)
= 0.850

Use when FN is costly: cancer screening

⚖️

F1-Score

Harmonic mean of Precision and Recall. Balances both.

2 * P * R / (P + R)

= 2*0.914*0.85
/(0.914+0.85)
= 0.881

Best balanced metric for imbalanced data

📄

classification_report

Prints all three metrics for every class in one function call.

from sklearn.metrics
  import classification
  _report
print(classification
  _report(y_test,y_pred))

One call gives the full picture

precision_recall_f1.py● LIVE

Chapter 3 of 4

03

Cross-Validation

A single train-test split gives one score. Cross-validation gives you the mean and standard deviation across K splits — a statistically reliable estimate of true model performance.

K-Fold Cross-Validation

Split data into K folds. Train on K-1 folds, test on the remaining fold. Repeat K times. Average the K scores. The standard is K=5 or K=10.

🔁

cross_val_score

One function call handles the entire K-fold process automatically. It splits the data, trains the model K times, evaluates each fold, and returns an array of K scores. You then take the mean and standard deviation.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model, X, y,
    cv=5,              # 5 folds
    scoring='f1'       # metric to use
)

print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")
# F1: 0.881 +/- 0.024 ← mean +/- standard deviation

💡Standard deviation matters: 0.88 +/- 0.02 is a stable model. 0.88 +/- 0.15 is an unstable model that may perform poorly in production even though the mean looks fine.

cross_validation.py● LIVE

Chapter 4 of 4

04

ROC-AUC

The Receiver Operating Characteristic curve and Area Under the Curve. Measures how well the model ranks positives above negatives across all possible thresholds.

Complete Model Evaluation Report

full_evaluation_report.pyFull Report

1from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

2from sklearn.model_selection import cross_val_score

3

4# Assume model is trained, y_test and y_pred are ready

5print("=== Confusion Matrix ===")

6print(confusion_matrix(y_test, y_pred))

7

8print("=== Classification Report ===")

9print(classification_report(y_test, y_pred))

10

11# ROC-AUC requires probabilities, not hard predictions

12y_prob = model.predict_proba(X_test)[:, 1]

13auc = roc_auc_score(y_test, y_prob)

14print(f"ROC-AUC: {auc:.3f}")

15

16# Cross-validated F1 for reliability

17cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1')

18print(f"CV F1: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")

This is the complete evaluation workflow used at every data science team. Run confusion_matrix first to see mistake types. classification_report for precision, recall, F1 per class. ROC-AUC for ranking quality (line 12: predict_proba returns probabilities, not 0/1 predictions). Cross-validation for reliable, unbiased performance estimate.

Lesson Summary

You can now measure, verify, and trust any classification model. Here is your complete evaluation toolkit:

📊

Confusion Matrix

The foundation. TP, FP, FN, TN tell exactly what the model gets right and wrong. Always start here before any other metric.

🎯

Precision & Recall

Precision: of predictions, how many correct? Recall: of actual positives, how many found? Choose based on business cost of each error type.

⚖️

F1-Score

Harmonic mean of Precision and Recall. The go-to metric for imbalanced datasets. Use classification_report to get all metrics at once.

🔁

Cross-Validation

Always use cross_val_score(cv=5) instead of a single test score. Report mean +/- std. High std means the model is unstable.

🏆

Week 3 Complete!

You have mastered the full ML cycle: Intro to ML, Supervised Learning, Unsupervised Learning, and Model Evaluation. Open the Model Evaluation Lab, then the quiz, and Week 3 is done. Week 4: Capstone Project awaits.

✓ Video — Done

✎ Practical Lab — Next

❓ Quiz — Then Week 4

Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023