Post

Bagged Trees & Random Forests

Bagged Trees & Random Forests

๐ŸŒฒ Bagged Trees & ๐ŸŒณ Random Forests


๐ŸŽฏ 1. Bagged Trees (Bootstrap Aggregation)

Bagging = Train many decision trees on bootstrap samples and combine predictions.

Procedure

  1. Sample many bootstrap datasets from original data
  2. Train a decision tree on each dataset
  3. Combine predictions

Regression:

\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}^{(b)}(x)\]

Classification:

\[\hat{y} = \text{majority vote}(\hat{y}^{(1)},...,\hat{y}^{(B)})\]

Why Bagging Works โ€” Variance Reduction ๐Ÿ“‰

Averaging independent estimators reduces variance:

\[Var(\bar{Z}) = \frac{\sigma^2}{B}\]

Total error:

\[MSE = Bias^2 + Variance\]

๐Ÿ‘‰ Bagging mainly reduces variance (trees are highโ€‘variance models).


Numerical Example ๐ŸŽฒ

Suppose a single decision tree has:

  • Biasยฒ = 0.04
  • Variance = 0.25

Then:

\[MSE_{single} = 0.29\]

If we bag B = 25 independent trees:

\[Variance_{bag} = \frac{0.25}{25} = 0.01\] \[MSE_{bag} = 0.04 + 0.01 = 0.05\]

โžก Huge improvement from 0.29 โ†’ 0.05


๐ŸŒณ 2. Random Forest

Random Forest = Bagging + Feature Randomness

Key idea: Decorrelate trees to improve ensemble.

How it Works

  • Still use bootstrap sampling
  • But when splitting a node:
    • Instead of using all $p$ features
    • Randomly choose m features
    • Split using only those m

Typical choice:

\[m = \sqrt{p} \quad (\text{classification})\] \[m = \frac{p}{3} \quad (\text{regression})\]

Why Random Forest Beats Bagging ๐Ÿง 

If trees are highly correlated โ†’ averaging does not reduce variance much.

Variance of ensemble:

\[Var_{RF} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\]

Where:

  • $\rho$ = correlation between trees
  • Smaller $\rho$ โ†’ stronger variance reduction

Random feature selection โ†“ correlation โ†’ โ†“ variance โ†’ โ†“ error.


๐Ÿ“Š 3. Random Forest Behavior

As number of trees increases:

  • Training error โ†“
  • Test error stabilizes
  • Overfitting rarely happens

Because averaging stabilizes prediction.


๐ŸŽฒ 4. Example โ€” Effect of m (Feature Subset)

Suppose:

  • p = 100 features

Compare:

mBehavior
m = 100 (Bagging)Trees very similar โ†’ high correlation
m = 50Slight decorrelation
m = โˆš100 = 10Strong decorrelation โ†’ best performance
m = 1Too random โ†’ weak trees

Typical best choice:

\[m \approx \sqrt{p}\]

๐Ÿ“Œ 5. Random Forest Advantages ๐Ÿ‘

  • Strong prediction accuracy
  • Handles nonlinear relationships
  • Works with highโ€‘dimensional data
  • Resistant to overfitting
  • Implicit feature selection
  • Robust to noise

โš ๏ธ 6. Limitations ๐Ÿ‘Ž

  • Less interpretable than single tree
  • High computation for many trees
  • Large memory usage
  • Can struggle with very sparse signals

๐Ÿ“ˆ 7. Outโ€‘ofโ€‘Bag (OOB) Error

Each tree is trained on ~63% of data (bootstrap).

Remaining ~37% = Outโ€‘ofโ€‘Bag samples

Use OOB samples as validation โ†’ unbiased error estimate.

No need for crossโ€‘validation ๐Ÿ‘


๐Ÿ” 8. Random Forest vs Bagging

MethodKey Idea
BaggingBootstrap + averaging
Random ForestBagging + feature randomness
GoalReduce variance
DifferenceRF decorrelates trees

๐Ÿง  9. When Random Forest Works Best

  • High variance models (decision trees)
  • Nonlinear data
  • Many features
  • Complex decision boundary

๐Ÿš€ 10. Summary

  • Bagging reduces variance via averaging
  • Random Forest reduces variance even more by decorrelating trees
  • More trees โ†’ stable performance
  • Feature randomness is key
  • OOB error gives builtโ€‘in validation
This post is licensed under CC BY 4.0 by the author.