Data Preprocessing and Augmentation

Posted Feb 10, 2026

1 min read

📊 Data Preprocessing and Augmentation

1️⃣ Zero-Centering & Normalization

Zero-Centering

Given dataset:

\[X \in \mathbb{R}^{N \times D}\]

Compute mean:

\[\mu = \frac{1}{N} \sum_{i=1}^{N} x_i\]

Zero-centered data:

\[X_{centered} = X - \mu\]

Normalization (Standardization)

Standard deviation:

\[\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2}\]

Normalized data:

\[X_{norm} = \frac{X - \mu}{\sigma}\]

2️⃣ PCA (Principal Component Analysis)

Covariance matrix:

\[\Sigma = \frac{1}{N} X_{centered}^T X_{centered}\]

Eigen decomposition:

\[\Sigma = U \Lambda U^T\]

Projection:

\[X_{PCA} = X_{centered} U\]

3️⃣ Whitening

Whitened data:

\[X_{white} = X_{centered} U \Lambda^{-1/2}\]

After whitening:

\[Cov(X_{white}) = I\]

4️⃣ Data Augmentation

General transformation:

\[x' = T(x)\]

Where $T$ may represent:

Translation
Rotation
Scaling
Shearing
Cropping
Noise injection

5️⃣ Color Jitter (RGB → HSL)

Normalize RGB:

\[R' = \frac{R}{255}, \quad G' = \frac{G}{255}, \quad B' = \frac{B}{255}\]

Define:

\[C_{max} = \max(R', G', B')\] \[C_{min} = \min(R', G', B')\] \[\Delta = C_{max} - C_{min}\]

Hue

\[H = \begin{cases} 0 & \Delta = 0 \\ 60^\circ \times \frac{G' - B'}{\Delta} \mod 6 & C_{max} = R' \\ 60^\circ \times \left(\frac{B' - R'}{\Delta} + 2\right) & C_{max} = G' \\ 60^\circ \times \left(\frac{R' - G'}{\Delta} + 4\right) & C_{max} = B' \end{cases}\]

Saturation

\[S = \begin{cases} 0 & \Delta = 0 \\ \frac{\Delta}{1 - |2L - 1|} & \Delta \ne 0 \end{cases}\]

Lightness

\[L = \frac{C_{max} + C_{min}}{2}\]

6️⃣ Random Cropping & Scaling

Training:

Pick random $L \in [256,480]$
Resize shorter side to $L$
Sample $224 \times 224$ crop

Testing:

Resize to multiple scales
Use multiple crops
Average predictions

🎯 Summary

Zero-centering improves optimization stability.
Normalization equalizes feature scales.
PCA decorrelates features.
Whitening enforces identity covariance.
Data augmentation improves generalization.

Artificial Intelligence, Artificial Intelligence - Optimization

This post is licensed under CC BY 4.0 by the author.