Support Vector Machine (SVM) โ Soft Margin with Slack Variables
๐ Support Vector Machine (SVM) โ Soft Margin with Slack Variables
๐ช๏ธ Noisy & NonโSeparable Data
Real-world data are often not perfectly separable due to noise, overlap, or outliers.
The classical hardโmargin SVM fails in this case because it enforces:
for all samples, which may be impossible.
๐ Solution: introduce Slack Variables โ Soft Margin SVM
๐ง Idea of Slack Variables
We allow violations of the margin constraint by introducing:
\[\xi_i \ge 0\]Modified constraint:
\[y_i (w^T x_i + b) \ge 1 - \xi_i\]Interpretation:
| Case | Meaning |
|---|---|
| $\xi_i = 0$ | Correct & outside margin |
| $0 < \xi_i < 1$ | Inside margin |
| $\xi_i = 1$ | On wrong side of boundary |
| $\xi_i > 1$ | Misclassified |
Slack definition:
\[\xi_i = \max(0, 1 - y_i(w^T x_i + b))\]This becomes Hinge Loss.
๐ฏ Soft Margin Primal Optimization
Hard margin objective:
\[\min_w \frac{1}{2} \|w\|^2\]Soft margin adds penalty for violations:
\[\min_{w,b,\xi} \quad \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i\]subject to:
\[y_i(w^T x_i + b) \ge 1 - \xi_i,\quad \xi_i \ge 0\]โ๏ธ Meaning of Hyperparameter $C$
\[C \uparrow \Rightarrow \text{penalize errors more โ smaller margin}\] \[C \downarrow \Rightarrow \text{allow more errors โ larger margin}\]๐งฎ Lagrangian Formulation
We introduce multipliers:
- $\alpha_i \ge 0$ for margin constraint
- $\mu_i \ge 0$ for $\xi_i \ge 0$
Lagrangian:
\[\mathcal{L}(w,b,\xi,\alpha,\mu) = \frac{1}{2} \|w\|^2 + C\sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \big(y_i(w^T x_i + b) - 1 + \xi_i\big) - \sum_{i=1}^n \mu_i \xi_i\]๐ KKT Conditions
Derivative w.r.t $w$
\[\frac{\partial \mathcal{L}}{\partial w} = w - \sum_{i=1}^n \alpha_i y_i x_i = 0\] \[\Rightarrow w = \sum_{i=1}^n \alpha_i y_i x_i\]Derivative w.r.t $b$
\[\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \alpha_i y_i = 0\] \[\Rightarrow \sum_{i=1}^n \alpha_i y_i = 0\]Derivative w.r.t $\xi_i$
\[\frac{\partial \mathcal{L}}{\partial \xi_i} = C - \alpha_i - \mu_i = 0\]Since $\mu_i \ge 0$:
\[0 \le \alpha_i \le C\]๐ฏ Dual Optimization Problem
Substitute $w$ into Lagrangian:
\[\max_{\alpha} \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j x_i^T x_j\]subject to:
\[0 \le \alpha_i \le C\] \[\sum_{i=1}^n \alpha_i y_i = 0\]โจ Final Decision Function
After solving $\alpha$:
\[w = \sum_{i=1}^n \alpha_i y_i x_i\]Prediction:
\[f(x) = \text{sign}\left( \sum_{i=1}^n \alpha_i y_i x_i^T x + b \right)\]Only points with $\alpha_i > 0$ โ Support Vectors
๐ Geometric Interpretation
Margin width:
\[\text{Margin} = \frac{2}{\|w\|}\]Slack allows some points inside margin or misclassified while still maximizing margin.
๐ Effect of $C$
| Small $C$ | Large Margin | More tolerant |
|---|---|---|
| Large $C$ | Small Margin | Less tolerant |
๐ฅ Connection to Hinge Loss
Softโmargin SVM equals minimizing:
\[\frac{1}{2} \|w\|^2 + C \sum_{i=1}^n \max(0, 1 - y_i(w^T x_i + b))\]Regularization + Hinge Loss
๐ Kernel Extension (Optional)
Replace inner product:
\[x_i^T x_j \rightarrow K(x_i, x_j)\]Dual becomes:
\[\max_{\alpha} \sum \alpha_i - \frac{1}{2}\sum\sum \alpha_i \alpha_j y_i y_j K(x_i,x_j)\]๐ Summary
- Hard margin โ requires separable data
- Slack variables โ allow violations
- $C$ controls margin vs errors
- Dual โ depends only on inner products
- Support vectors define the boundary