Bootstrapping

Posted Feb 8, 2026 Updated Mar 25, 2026

2 min read

Bootstrapping

🎲 Bootstrapping

📌 1. What is Bootstrapping?

Bootstrapping is a resampling technique used when we cannot sample additional data from the true distribution.

The true distribution is usually unknown
Our goal is often to estimate properties of that distribution

Instead of collecting new independent data, we:

👉 Repeatedly sample from the original dataset with replacement

Each bootstrap dataset:

Same size as original dataset
Some samples appear multiple times
Some samples may not appear at all

🔁 2. Bootstrap Procedure

Let original dataset be:

\[Z = \{(x_1,y_1), (x_2,y_2), ..., (x_n,y_n)\}\]

We generate B bootstrap datasets:

\[Z^{*1}, Z^{*2}, ..., Z^{*B}\]

Each created by sampling with replacement from $Z$.

For each dataset, compute estimator:

\[\hat{\alpha}^{*b}, \quad b = 1,2,...,B\]

📏 Bootstrap Standard Error

\[SE_B(\hat{\alpha}) = \sqrt{ \frac{1}{B-1} \sum_{b=1}^{B} \left(\hat{\alpha}^{*b} - \bar{\alpha}^*\right)^2 }\]

where

\[\bar{\alpha}^* = \frac{1}{B} \sum_{b=1}^{B} \hat{\alpha}^{*b}\]

This estimates the standard error of the estimator.

⚠️ 3. Limitations of Bootstrapping

Bootstrapping assumes:

i.i.d assumption

Samples must be:

\[\text{Independent and Identically Distributed (i.i.d)}\]

If NOT true (e.g. time series data):

Sampling individual observations breaks temporal structure
Instead use block bootstrap

Block Bootstrap

Create blocks of consecutive observations
Sample blocks with replacement
Reconstruct dataset from sampled blocks

Used in:

Time series
Session‑based recommendation systems

🔍 4. Bootstrapping vs Cross‑Validation

Can bootstrap estimate prediction error?

Short answer: ❌ No

Reason

Cross‑Validation

No overlap between training and validation sets
Independent validation → unbiased estimate

Bootstrapping

Samples drawn with replacement
Bootstrap datasets overlap heavily
Not independent → biased estimate

📊 5. Why does each bootstrap contain ~2/3 of data?

Probability a sample is NOT selected in one draw:

\[1 - \frac{1}{n}\]

Probability it is never selected in $n$ draws:

\[\left(1 - \frac{1}{n}\right)^n\]

Taking limit:

\[\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368\]

So:

About 36.8% NOT included
About 63.2% included

👉 Each bootstrap sample contains ≈ 2/3 of original data

🚨 6. Bias of Bootstrap Error

Because bootstrap datasets overlap:

Bootstrap tends to underestimate true prediction error
Validation sets are not fully independent

🧠 7. Summary

Bootstrapping = sampling with replacement
Used to estimate variance, SE, confidence intervals
Requires i.i.d assumption
Each bootstrap contains ~63% unique samples
Cannot reliably estimate prediction error
Use cross‑validation instead for model evaluation

Machince Learning, Machince Learning - Optimization

This post is licensed under CC BY 4.0 by the author.

🎲 Bootstrapping

📌 1. What is Bootstrapping?

🔁 2. Bootstrap Procedure

📏 Bootstrap Standard Error

⚠️ 3. Limitations of Bootstrapping

i.i.d assumption

Block Bootstrap

🔍 4. Bootstrapping vs Cross‑Validation

Can bootstrap estimate prediction error?

Reason

Cross‑Validation

Bootstrapping

📊 5. Why does each bootstrap contain ~2/3 of data?

🚨 6. Bias of Bootstrap Error

🧠 7. Summary

Trending Tags