Post

Feature Selection

Feature Selection

📊 Feature Selection and Stepwise Methods


Feature Selection

Motivation

  • Sometimes the inverse matrix does not exist.
  • Sometimes the inverse matrix becomes numerically unstable.

Example:

\[x + y = 3\] \[2x + 2y \approx 6.0000000000000001\]

This leads to a near-singular matrix and unstable coefficient estimates.

In machine learning, instead of manually choosing variables, we can let the model evaluate which variables are useful.


Best Subset Selection

Idea

Compute least squares for all possible subsets of predictors, then choose the best model based on a criterion balancing training error and model size.


Procedure

1. Null Model

  • Let \(M_0\) be the model with no predictors.
  • It predicts the sample mean.

2. For \(k = 1,2,\dots,p\)

  • Fit all models containing exactly \(k\) predictors.

Number of models:

\[\binom{p}{k} = \frac{p!}{k!(p-k)!}\]
  • Choose the best \(M_k\) using:
    • Smallest RSS
    • Largest \(R^2\)

3. Final Model Selection

Choose the best model among \(M_0, M_1, \dots, M_p\) using:

  • Prediction error
  • \[R^2\]
  • Adjusted \(R^2\)
  • AIC / BIC / \(C_p\)

Key Takeaway

  • Searches all possible subsets
  • Finds globally best model
  • Computationally expensive

Issues with Best Subset Selection

1. Computational Limitation

Total number of models:

\[2^p\]

Example: \(p = 40\) → over 1 billion models.


2. Statistical Problems

  • Large search space → overfitting
  • High variance estimates
  • Curse of dimensionality

3. Practical Alternative

Stepwise methods explore fewer models while keeping good performance.


Forward Stepwise Selection

Idea

Start with no predictors and add one variable at a time that gives the largest improvement.


Procedure

1. Null Model

Start with \(M_0\).


2. For \(k = 0,1,\dots,p-1\)

  • Consider \(p-k\) models by adding one new predictor.
  • Choose best model \(M_{k+1}\) using:
    • Smallest RSS
    • Largest \(R^2\)

3. Final Selection

Choose best among \(M_0,\dots,M_p\) using:

  • Prediction error
  • Adjusted \(R^2\)
  • AIC / BIC / \(C_p\)

Key Takeaway

  • Much faster than best subset
  • Searches restricted model space
  • May miss global optimum

Complexity

Stepwise:

\[O(p^2)\]

Best subset:

\[O(2^p)\]

Backward Stepwise Selection

Idea

Start from the full model and remove the least useful predictor iteratively.


Procedure

1. Full Model

Start with \(M_p\) (all predictors).


2. For \(k = p,p-1,\dots,1\)

  • Remove one predictor from \(M_k\)
  • Choose best \(M_{k-1}\) using:
    • Smallest RSS
    • Largest \(R^2\)

3. Final Selection

Choose best among \(M_0,\dots,M_p\) using:

  • Prediction error
  • Adjusted \(R^2\)
  • AIC / BIC / \(C_p\)

Key Takeaway

  • Starts with full model
  • Efficient vs best subset
  • Requires \(n > p\)

Forward Stagewise Selection

Idea

Add predictors gradually without refitting previous coefficients.
A predictor may be selected multiple times.


Procedure

1. Null Model

Start with \(M_0\).


2. For \(k = 1,2,\dots,K\)

  • For each predictor \(j\):
    • Take a small step in coefficient \(\beta_j\)
  • Choose best \(M_k\) using:
    • Smallest RSS
    • Largest \(R^2\)

3. Final Selection

Choose best among \(M_0,\dots,M_K\) using:

  • Prediction error
  • Adjusted \(R^2\)
  • AIC / BIC / \(C_p\)

Key Takeaway

  • Coefficients updated gradually
  • Predictor may be selected multiple times
  • Related to Lasso and boosting
  • More stable than forward stepwise

Comparison

MethodCoefficient UpdateFlexibilityGreediness
StepwiseRefit all coefficients each stepHigherGreedy
StagewiseUpdate only one coefficient stepLowerMore greedy
This post is licensed under CC BY 4.0 by the author.