Pruning of Decision Trees
🌳 Decision Trees — Overfitting and Pruning (Complete Reconstruction)
This document reconstructs the provided slides with minimal summarization, preserving equations, algorithm flow, and visual interpretation.
1️⃣ Overfitting in Decision Trees
A decision tree can easily overfit.
Example shown:
- A large region is predicted based on only one training point
- This leads to very irregular decision regions
- The model memorizes noise instead of learning structure
Result:
- Extremely low training error
- Very poor generalization
- High variance model
2️⃣ Model Selection Perspective
To reduce overfitting, we move toward simpler models.
Key ideas:
- Fewer splits → smoother boundary
- Simpler tree → higher bias, lower variance
- Complex tree → low bias, high variance
Trade-off:
| Property | Small Tree | Large Tree |
|---|---|---|
| Complexity | Low | High |
| Boundary | Smooth | Irregular |
| Bias | High | Low |
| Variance | Low | High |
| Risk | Underfitting | Overfitting |
3️⃣ Handling Overfitting
Strategy 1: Build a smaller tree
- Fewer regions \(R_1,\dots,R_J\)
- Reduces variance
- Slight increase in bias
- Often improves interpretability
However:
A weak early split might later lead to a strong split.
This is due to the greedy nature of tree building.
Thus, stopping early may be short-sighted.
4️⃣ Grow Large Tree then Prune
Alternative strategy:
- Grow a very large tree \(T_0\)
- Prune it back to smaller trees
Weakest Link (Cost-Complexity) Pruning
We define a sequence of trees indexed by \(\alpha\):
\[\sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\]Where:
number of leaf nodes $$ T $$ - controls complexity penalty \(\alpha\)
- Larger → smaller tree \(\alpha\)
- Smaller → larger tree \(\alpha\)
Goal:
Find subtree minimizing penalized loss.
Pruning works by:
- Merging leaf nodes
- Removing weakest split
5️⃣ Tree Building with Pruning
Algorithm:
- Grow large tree using recursive binary splitting
- Apply weakest link pruning to obtain sequence of subtrees
- Use K-fold cross-validation to select \(\alpha\)
- Choose subtree corresponding to best \(\alpha\)
6️⃣ Example — Unpruned Tree
A very large regression tree \(T_0\) is first grown using the training data.
Characteristics:
- Many splits
- Low training error
- High risk of overfitting
7️⃣ Example — Selecting Tree Size
We evaluate tree size using:
- Training error
- Cross-validation error
- Test error
Observation:
Best validation performance occurs at 3 leaf nodes.
Thus, we prune the large tree down to optimal size.
📐 Mathematical Summary
Tree Objective
\[\min \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2\]Cost-Complexity Objective
\[\min \sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T|\]🧠 Key Insights
- Decision trees easily overfit
- Small tree → high bias, low variance
- Large tree → low bias, high variance
- Best approach = grow large tree then prune
- Cross-validation selects optimal tree size
🌟 Final Insight
Optimal decision trees balance fit and complexity using cost-complexity pruning and cross-validation.