Post

Bagging & Ensemble

Bagging & Ensemble

🌲 Bagging & Ensemble Method


🎯 1. What is Bagging (Bootstrap Aggregation)?

Bagging creates B bootstrap datasets from the original training data and trains B separate models.

Each model:

\[\hat{f}^{(b)}(x), \quad b = 1,2,...,B\]

Final prediction (Regression):

\[\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}^{(b)}(x)\]

Final prediction (Classification):

\[\hat{y} = \text{majority vote of } \hat{y}^{(1)},...,\hat{y}^{(B)}\]

πŸ“Œ Bagging averages predictions, not parameters β†’ works for ANY model.


🧠 2. Why Bagging Works (Variance Reduction)

Bagging is similar to wisdom of crowd.

If we average independent estimators:

\[Var(\bar{Z}) = \frac{\sigma^2}{n}\]

Thus averaging reduces variance.

Total error:

\[MSE(\hat{\theta}) = Var(\hat{\theta}) + Bias(\hat{\theta})^2\]

πŸ‘‰ Bagging reduces variance without increasing bias.

Cost: Need to train B models.


πŸ“Š 3. Mathematical Analysis of Bagging

Let:

\[y_b(x) = h(x) + \epsilon_b(x)\]

Where:

  • $h(x)$ = true function
  • $\epsilon_b(x)$ = error of model $b$

Error of single model

\[E_{single} = E_x[(y_b(x) - h(x))^2] = E_x[\epsilon_b(x)^2]\]

Error of combined model

\[E_{comb} = E_x\left[\left(\frac{1}{B}\sum_{b=1}^{B} y_b(x) - h(x)\right)^2\right]\]

πŸ“Œ Theorem 1 β€” Ensemble never worse

The expected error of ensemble ≀ single model.

Using Jensen’s inequality:

\[E_{single} \ge E_{comb}\]

πŸ“Œ Theorem 2 β€” Error can shrink by 1/B

Expand:

\[E_{comb} = E_x\left[\left(\frac{1}{B}\sum_{b=1}^{B}\epsilon_b(x)\right)^2\right]\] \[= E_x\left[\frac{1}{B^2}\sum_{b=1}^{B}\epsilon_b(x)^2 + \frac{2}{B^2}\sum_{j\ne k}\epsilon_j(x)\epsilon_k(x)\right]\]

If models are independent:

\[E[\epsilon_j(x)\epsilon_k(x)] = 0\]

Then:

\[E_{comb} = \frac{1}{B}E_{single}\]

πŸ“Œ If models identical β†’ no gain
πŸ“Œ More independence β†’ better ensemble


🌐 4. General Ensemble Learning

Ensemble = Combine multiple models to improve prediction.

Types

1. Bagging

  • Parallel models
  • Reduce variance
  • Example: Random Forest 🌲

2. Boosting

  • Sequential models
  • Reduce bias + variance
  • Example: AdaBoost / Gradient Boosting ⚑

3. Stacking

  • Combine different model types using meta‑model

πŸ”€ 5. Can Different Models Be Ensembled?

YES β€” Heterogeneous Ensemble

You can combine:

  • Linear model + Tree + Neural Net
  • SVM + Random Forest + Logistic
  • Any models with prediction output

Common combination methods:

Averaging (Regression)

\[\hat{y} = \sum_{m=1}^{M} w_m \hat{y}_m\]

Majority Vote (Classification)

\[\hat{y} = \arg\max_k \sum_{m=1}^{M} I(\hat{y}_m = k)\]

Stacking (Meta‑Learning)

Train second model:

\[\hat{y} = g(\hat{y}_1, \hat{y}_2, ..., \hat{y}_M)\]

πŸ“Œ 6. When Ensemble Works Best

Ensemble improves when:

  1. Models are accurate
  2. Models are diverse (uncorrelated errors)
  3. Individual models not identical

Key idea:

\[Var(\text{ensemble}) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2\]

Where:

  • $\rho$ = correlation between models
  • Lower $\rho$ β†’ stronger ensemble

⚠️ 7. Limitations

  • High computation cost
  • Harder to interpret
  • Little gain if models highly correlated

πŸš€ 8. Summary

  • Bagging reduces variance
  • Ensemble error ≀ single model
  • Independence between models is critical
  • Can combine same or different model types
  • Foundation of Random Forest, Boosting, Stacking
This post is licensed under CC BY 4.0 by the author.