Model Evaluation
Model Evaluation
π How to check Model Evaluation
Choosing the Optimal Model
How to Choose a Model
Question
Which model is the best one?
Common answers:
- Smallest RSS
- Largest \(R^2\)
Observation
- The model containing all predictors will always have:
- The smallest RSS
- The largest \(R^2\)
- Adding more variables seems to have nothing to lose.
- One might think useless variables will simply get coefficients close to zero, so there is no harm.
However, this reasoning is not correct for generalization.
Training vs Test Error
- Our goal is low test error, not low training error.
- Training error is usually a poor estimate of test error.
- A model with many variables may fit training data well but perform poorly on unseen data.
Overfitting Risk
- Using all variables can lead to overfitting.
- The model may capture noise instead of true signal.
- This increases variance and harms prediction on new data.
Key Principle
- Model selection should be based on test error (generalization).
- RSS and \(R^2\) from training data alone are not reliable for comparing models with different complexity.
Evaluation Strategy
Goal
We want to verify that the model learns general patterns, not memorization.
Hold-out Test Data
- Set aside a portion of data for testing / evaluation.
- Test data must NOT be used during training.
- Example: In Kaggle, test labels are hidden.
How to Split Data
Common Approach
- Random uniform split
- Typical ratio: 10%β20% test
Problem-Dependent Approach
- Time-series split:
- Train on past
- Evaluate on future
- Tests ability to predict future from past.
Key Takeaway
- Evaluate generalization, not memorization.
- Keep a strictly separate test set.
- Proper splitting is critical.
Choosing Hyperparameters
Why Hyperparameters Matter
- Many ML models require hyperparameters.
- Examples:
- Model complexity (number of predictors \(p\))
- Learning rate
- Regularization strength
How to Choose Hyperparameters
- Use data-driven selection:
- Train multiple models
- Compare performance
- Choose best
Validation Set
- Needed for hyperparameter tuning.
- Do NOT use test set for tuning.
- Split training data β training + validation.
Key Takeaway
- Hyperparameters control model complexity and behavior.
- Use validation set for tuning.
- Keep test set only for final evaluation.
Cross Validation
Procedure
Step 1: Train using training set
Fit model using training data (e.g., 70%).
Step 2: Evaluate using validation set
Measure performance on validation data (e.g., 20%).
Step 3: Hyperparameter tuning
Repeat with different hyperparameters, choose best.
Step 4: Retrain with best hyperparameters
Train using training + validation data.
Step 5: Final evaluation
Evaluate on test set (e.g., 10%).
Important Rule
- The test set must NOT be used until final evaluation.
Typical Split
- Training: 70%
- Validation: 20%
- Test: 10%
Key Takeaway
- Validation β model selection & tuning
- Test β final unbiased evaluation
- Ensures proper measurement of generalization
K-Fold Cross Validation
Idea
- Split dataset into K folds.
- Train K times:
- K-1 folds β training
- 1 fold β validation
- Each fold used once as validation.
- Final performance = average validation score.
Procedure
- Split data into \(K\) folds.
- For each \(i = 1,\dots,K\):
- Train on all folds except \(i\)
- Evaluate on fold \(i\)
- Average results:
Example (K = 5)
- Fold 1 β validation, others β training
- Fold 2 β validation, others β training
- Fold 3 β validation, others β training
- Fold 4 β validation, others β training
- Fold 5 β validation, others β training
Each fold is used exactly once as validation.
Key Takeaway
- Uses all data for training and validation (in different rounds).
- Produces stable and reliable estimate of test error.
- Common choices:
- \[K = 5\]
- \[K = 10\]
This post is licensed under CC BY 4.0 by the author.