Bagged Trees & Random Forests
๐ฒ Bagged Trees & ๐ณ Random Forests
๐ฏ 1. Bagged Trees (Bootstrap Aggregation)
Bagging = Train many decision trees on bootstrap samples and combine predictions.
Procedure
- Sample many bootstrap datasets from original data
- Train a decision tree on each dataset
- Combine predictions
Regression:
\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}^{(b)}(x)\]Classification:
\[\hat{y} = \text{majority vote}(\hat{y}^{(1)},...,\hat{y}^{(B)})\]Why Bagging Works โ Variance Reduction ๐
Averaging independent estimators reduces variance:
\[Var(\bar{Z}) = \frac{\sigma^2}{B}\]Total error:
\[MSE = Bias^2 + Variance\]๐ Bagging mainly reduces variance (trees are highโvariance models).
Numerical Example ๐ฒ
Suppose a single decision tree has:
- Biasยฒ = 0.04
- Variance = 0.25
Then:
\[MSE_{single} = 0.29\]If we bag B = 25 independent trees:
\[Variance_{bag} = \frac{0.25}{25} = 0.01\] \[MSE_{bag} = 0.04 + 0.01 = 0.05\]โก Huge improvement from 0.29 โ 0.05
๐ณ 2. Random Forest
Random Forest = Bagging + Feature Randomness
Key idea: Decorrelate trees to improve ensemble.
How it Works
- Still use bootstrap sampling
- But when splitting a node:
- Instead of using all $p$ features
- Randomly choose m features
- Split using only those m
Typical choice:
\[m = \sqrt{p} \quad (\text{classification})\] \[m = \frac{p}{3} \quad (\text{regression})\]Why Random Forest Beats Bagging ๐ง
If trees are highly correlated โ averaging does not reduce variance much.
Variance of ensemble:
\[Var_{RF} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\]Where:
- $\rho$ = correlation between trees
- Smaller $\rho$ โ stronger variance reduction
Random feature selection โ correlation โ โ variance โ โ error.
๐ 3. Random Forest Behavior
As number of trees increases:
- Training error โ
- Test error stabilizes
- Overfitting rarely happens
Because averaging stabilizes prediction.
๐ฒ 4. Example โ Effect of m (Feature Subset)
Suppose:
- p = 100 features
Compare:
| m | Behavior |
|---|---|
| m = 100 (Bagging) | Trees very similar โ high correlation |
| m = 50 | Slight decorrelation |
| m = โ100 = 10 | Strong decorrelation โ best performance |
| m = 1 | Too random โ weak trees |
Typical best choice:
\[m \approx \sqrt{p}\]๐ 5. Random Forest Advantages ๐
- Strong prediction accuracy
- Handles nonlinear relationships
- Works with highโdimensional data
- Resistant to overfitting
- Implicit feature selection
- Robust to noise
โ ๏ธ 6. Limitations ๐
- Less interpretable than single tree
- High computation for many trees
- Large memory usage
- Can struggle with very sparse signals
๐ 7. OutโofโBag (OOB) Error
Each tree is trained on ~63% of data (bootstrap).
Remaining ~37% = OutโofโBag samples
Use OOB samples as validation โ unbiased error estimate.
No need for crossโvalidation ๐
๐ 8. Random Forest vs Bagging
| Method | Key Idea |
|---|---|
| Bagging | Bootstrap + averaging |
| Random Forest | Bagging + feature randomness |
| Goal | Reduce variance |
| Difference | RF decorrelates trees |
๐ง 9. When Random Forest Works Best
- High variance models (decision trees)
- Nonlinear data
- Many features
- Complex decision boundary
๐ 10. Summary
- Bagging reduces variance via averaging
- Random Forest reduces variance even more by decorrelating trees
- More trees โ stable performance
- Feature randomness is key
- OOB error gives builtโin validation