Feature Selection

Posted Feb 4, 2026 Updated Apr 28, 2026

2 min read

Feature Selection

📊 Feature Selection and Stepwise Methods

Feature Selection

Motivation

Sometimes the inverse matrix does not exist.
Sometimes the inverse matrix becomes numerically unstable.

Example:

\[x + y = 3\] \[2x + 2y \approx 6.0000000000000001\]

This leads to a near-singular matrix and unstable coefficient estimates.

In machine learning, instead of manually choosing variables, we can let the model evaluate which variables are useful.

Best Subset Selection

Idea

Compute least squares for all possible subsets of predictors, then choose the best model based on a criterion balancing training error and model size.

Procedure

1. Null Model

Let \(M_0\) be the model with no predictors.
It predicts the sample mean.

2. For \(k = 1,2,\dots,p\)

Fit all models containing exactly \(k\) predictors.

Number of models:

\[\binom{p}{k} = \frac{p!}{k!(p-k)!}\]

Choose the best \(M_k\) using:
- Smallest RSS
- Largest \(R^2\)

3. Final Model Selection

Choose the best model among \(M_0, M_1, \dots, M_p\) using:

Prediction error
\[R^2\]
Adjusted \(R^2\)
AIC / BIC / \(C_p\)

Key Takeaway

Searches all possible subsets
Finds globally best model
Computationally expensive

Issues with Best Subset Selection

1. Computational Limitation

Total number of models:

\[2^p\]

Example: \(p = 40\) → over 1 billion models.

2. Statistical Problems

Large search space → overfitting
High variance estimates
Curse of dimensionality

3. Practical Alternative

Stepwise methods explore fewer models while keeping good performance.

Forward Stepwise Selection

Idea

Start with no predictors and add one variable at a time that gives the largest improvement.

Procedure

1. Null Model

Start with \(M_0\).

2. For \(k = 0,1,\dots,p-1\)

Consider \(p-k\) models by adding one new predictor.
Choose best model \(M_{k+1}\) using:
- Smallest RSS
- Largest \(R^2\)

3. Final Selection

Choose best among \(M_0,\dots,M_p\) using:

Prediction error
Adjusted \(R^2\)
AIC / BIC / \(C_p\)

Key Takeaway

Much faster than best subset
Searches restricted model space
May miss global optimum

Complexity

Stepwise:

\[O(p^2)\]

Best subset:

\[O(2^p)\]

Backward Stepwise Selection

Idea

Start from the full model and remove the least useful predictor iteratively.

Procedure

1. Full Model

Start with \(M_p\) (all predictors).

2. For \(k = p,p-1,\dots,1\)

Remove one predictor from \(M_k\)
Choose best \(M_{k-1}\) using:
- Smallest RSS
- Largest \(R^2\)

3. Final Selection

Choose best among \(M_0,\dots,M_p\) using:

Prediction error
Adjusted \(R^2\)
AIC / BIC / \(C_p\)

Key Takeaway

Starts with full model
Efficient vs best subset
Requires \(n > p\)

Forward Stagewise Selection

Idea

Add predictors gradually without refitting previous coefficients.
A predictor may be selected multiple times.

Procedure

1. Null Model

Start with \(M_0\).

2. For \(k = 1,2,\dots,K\)

For each predictor \(j\):
- Take a small step in coefficient \(\beta_j\)
Choose best \(M_k\) using:
- Smallest RSS
- Largest \(R^2\)

3. Final Selection

Choose best among \(M_0,\dots,M_K\) using:

Prediction error
Adjusted \(R^2\)
AIC / BIC / \(C_p\)

Key Takeaway

Coefficients updated gradually
Predictor may be selected multiple times
Related to Lasso and boosting
More stable than forward stepwise

Comparison

Method	Coefficient Update	Flexibility	Greediness
Stepwise	Refit all coefficients each step	Higher	Greedy
Stagewise	Update only one coefficient step	Lower	More greedy

Machince Learning, Machince Learning - Foundation

This post is licensed under CC BY 4.0 by the author.

📊 Feature Selection and Stepwise Methods

Feature Selection

Motivation

Best Subset Selection

Idea

Procedure

1. Null Model

2. For \(k = 1,2,\dots,p\)

3. Final Model Selection

Key Takeaway

Issues with Best Subset Selection

1. Computational Limitation

2. Statistical Problems

3. Practical Alternative

Forward Stepwise Selection

Idea

Procedure

1. Null Model

2. For \(k = 0,1,\dots,p-1\)

3. Final Selection

Key Takeaway

Complexity

Backward Stepwise Selection

Idea

Procedure

1. Full Model

2. For \(k = p,p-1,\dots,1\)

3. Final Selection

Key Takeaway

Forward Stagewise Selection

Idea

Procedure

1. Null Model

2. For \(k = 1,2,\dots,K\)

3. Final Selection

Key Takeaway

Comparison

Trending Tags