Chapters:
Synkoc AI/ML Internship · Week 3 · Lesson 10 of 13
Model
Evaluation
How do you know if your ML model is actually good? Master the Confusion Matrix, Precision, Recall, F1-Score, Cross-Validation, and ROC-AUC — the tools every data scientist uses to measure, trust, and compare models.
📊 Confusion Matrix
⚙ Precision / Recall
🔁 Cross-Validation
📈 ROC-AUC
🧑‍💻
Synkoc Instructor
AI/ML Professional · Bangalore
⏳ ~55 minutes
🟢 Intermediate
Why Evaluation Matters
A model that says 99% accuracy sounds great. But what if 99% of your data is the same class? A naive model that predicts everything as one class gets 99% accuracy — and is completely useless.
🚫
Accuracy is Not Enough
In fraud detection, 99.9% of transactions are legitimate. A model predicting "not fraud" always gets 99.9% accuracy but catches zero fraudsters. Accuracy hides this.
fraud dataset: 0.1% fraud
naive model: 99.9% accuracy
frauds caught: 0 ← disaster
⚖️
The Four Evaluation Tools
Confusion Matrix reveals exactly what the model is getting right and wrong. Precision and Recall measure quality for each class. Cross-Validation ensures the score generalises. ROC-AUC measures ranking ability.
from sklearn.metrics import (
  confusion_matrix,
  classification_report,
  cross_val_score, roc_auc_score)
🎯
Train vs Test vs Cross-Val
Training score tells you nothing about real performance. Test score depends on one lucky split. Cross-validation averages across K splits for a reliable, unbiased estimate.
train score: 0.98 (overfitting?)
test score: 0.72 (one split)
cv score: 0.76 +/- 0.03 (reliable)
📊
Real Business Decisions
Medical diagnosis: missing a cancer (False Negative) is catastrophic. Spam filter: flagging a real email as spam (False Positive) is annoying. The right metric depends on which mistake is more costly.
medical: maximise Recall
spam filter: maximise Precision
balanced: use F1-Score
Chapter 1 of 4
01
Confusion Matrix
The foundation of all classification evaluation. A 2x2 grid that shows exactly where your model is correct, and exactly where it is making each type of mistake.
The Confusion Matrix
For binary classification, a 2x2 matrix. Rows are actual classes. Columns are predicted classes. Four cells — TP, FP, FN, TN — tell the complete story.
Predicted
Positive
Negative
Actual Positive
TP85
FN15
Actual Negative
FP8
TN92
True Positive
Model said Positive. Actually Positive. Correct!
False Negative
Model said Negative. Actually Positive. Missed it!
False Positive
Model said Positive. Actually Negative. False alarm!
True Negative
Model said Negative. Actually Negative. Correct!
💡Memory trick: True/False = was the prediction correct? Positive/Negative = what did the model predict? TP and TN are correct. FP and FN are both wrong — in opposite ways.
confusion_matrix.py● LIVE
Chapter 2 of 4
02
Precision, Recall & F1
Three metrics built from the confusion matrix. Each answers a different question about classification quality. Together they give the complete picture.
Precision, Recall & F1
Three complementary metrics. Accuracy ignores class imbalance. These three do not.
🎯
Precision
Of all positive predictions, what fraction were actually positive?
TP / (TP + FP)

= 85 / (85+8)
= 0.914
Use when FP is costly: spam filters
🔎
Recall
Of all actual positives, what fraction did the model find?
TP / (TP + FN)

= 85 / (85+15)
= 0.850
Use when FN is costly: cancer screening
⚖️
F1-Score
Harmonic mean of Precision and Recall. Balances both.
2 * P * R / (P + R)

= 2*0.914*0.85
  /(0.914+0.85)
= 0.881
Best balanced metric for imbalanced data
📄
classification_report
Prints all three metrics for every class in one function call.
from sklearn.metrics
  import classification
  _report
print(classification
  _report(y_test,y_pred))
One call gives the full picture
precision_recall_f1.py● LIVE
Chapter 3 of 4
03
Cross-Validation
A single train-test split gives one score. Cross-validation gives you the mean and standard deviation across K splits — a statistically reliable estimate of true model performance.
K-Fold Cross-Validation
Split data into K folds. Train on K-1 folds, test on the remaining fold. Repeat K times. Average the K scores. The standard is K=5 or K=10.
🔁
cross_val_score
One function call handles the entire K-fold process automatically. It splits the data, trains the model K times, evaluates each fold, and returns an array of K scores. You then take the mean and standard deviation.
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model, X, y,
    cv=5,              # 5 folds
    scoring='f1'       # metric to use
)

print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")
# F1: 0.881 +/- 0.024  ← mean +/- standard deviation
💡Standard deviation matters: 0.88 +/- 0.02 is a stable model. 0.88 +/- 0.15 is an unstable model that may perform poorly in production even though the mean looks fine.
cross_validation.py● LIVE
Chapter 4 of 4
04
ROC-AUC
The Receiver Operating Characteristic curve and Area Under the Curve. Measures how well the model ranks positives above negatives across all possible thresholds.
Complete Model Evaluation Report
full_evaluation_report.pyFull Report
1from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
2from sklearn.model_selection import cross_val_score
3 
4# Assume model is trained, y_test and y_pred are ready
5print("=== Confusion Matrix ===")
6print(confusion_matrix(y_test, y_pred))
7 
8print("=== Classification Report ===")
9print(classification_report(y_test, y_pred))
10 
11# ROC-AUC requires probabilities, not hard predictions
12y_prob = model.predict_proba(X_test)[:, 1]
13auc = roc_auc_score(y_test, y_prob)
14print(f"ROC-AUC: {auc:.3f}")
15 
16# Cross-validated F1 for reliability
17cv_scores = cross_val_score(model, X, y, cv=5, scoring='f1')
18print(f"CV F1: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}")
This is the complete evaluation workflow used at every data science team. Run confusion_matrix first to see mistake types. classification_report for precision, recall, F1 per class. ROC-AUC for ranking quality (line 12: predict_proba returns probabilities, not 0/1 predictions). Cross-validation for reliable, unbiased performance estimate.
Lesson Summary
You can now measure, verify, and trust any classification model. Here is your complete evaluation toolkit:
📊
Confusion Matrix
The foundation. TP, FP, FN, TN tell exactly what the model gets right and wrong. Always start here before any other metric.
🎯
Precision & Recall
Precision: of predictions, how many correct? Recall: of actual positives, how many found? Choose based on business cost of each error type.
⚖️
F1-Score
Harmonic mean of Precision and Recall. The go-to metric for imbalanced datasets. Use classification_report to get all metrics at once.
🔁
Cross-Validation
Always use cross_val_score(cv=5) instead of a single test score. Report mean +/- std. High std means the model is unstable.
🏆
Week 3 Complete!
You have mastered the full ML cycle: Intro to ML, Supervised Learning, Unsupervised Learning, and Model Evaluation. Open the Model Evaluation Lab, then the quiz, and Week 3 is done. Week 4: Capstone Project awaits.
✓ Video — Done
✎ Practical Lab — Next
❓ Quiz — Then Week 4
Synkoc IT Services · Bangalore · support@synkoc.com · +91-9019532023
Press ► Play to start the lesson with voice narration
0:00 / ~55:00
🔊
1 / 15