Post

Support Vector Machines with Kernels

Support Vector Machines with Kernels

📘 Support Vector Machines — Kernels, Logistic Comparison, and Multi‑Class


⚔️ SVM vs Logistic Regression

When classes are well separated

SVM tends to perform better because it maximizes the margin:

\[\max \frac{2}{\|w\|}\]

Logistic regression keeps pushing probabilities toward 0/1 and may become unstable when perfectly separable (weights → ∞).


When classes overlap (noisy data)

Logistic regression often performs similarly or better because it minimizes log loss:

\[\sum \log(1 + e^{-y_i f(x_i)})\]

while SVM minimizes hinge loss:

\[\sum \max(0, 1 - y_i f(x_i))\]

Probabilities

Logistic regression outputs:

\[P(y=1|x) = \frac{1}{1 + e^{-f(x)}}\]

SVM → only decision boundary (no probability unless calibrated).


Nonlinear boundaries

Kernel SVM naturally handles nonlinear boundaries using kernel trick.
Kernel logistic regression exists but is computationally heavier.


🌌 Linearly Non‑Separable Data → Kernel Trick

We map data into higher dimension:

\[x \rightarrow \phi(x)\]

Linear classifier becomes:

\[f(x) = w^T \phi(x) + b\]

Instead of explicit mapping, use kernel:

\[K(x_i, x_j) = \phi(x_i)^T \phi(x_j)\]

📐 Inner Products & Support Vectors

From SVM dual solution:

\[w = \sum_{i=1}^{n} \alpha_i y_i x_i\]

Prediction:

\[f(x) = \text{sign} \left( \sum_{i=1}^{n} \alpha_i y_i x_i^T x + b \right)\]

Only $\alpha_i > 0$ contribute → Support Vectors.


🧠 Kernelized SVM

Replace inner product:

\[x_i^T x_j \rightarrow K(x_i, x_j)\]

Decision function:

\[f(x) = \text{sign} \left( \sum_{i=1}^{n} \alpha_i y_i K(x_i, x) + b \right)\]

📊 Polynomial Kernel

\[K(x_i, x_j) = (1 + x_i^T x_j)^d\]

Implicitly maps data to all monomials up to degree $d$.


🌐 Radial Basis Function (RBF / Gaussian Kernel)

\[K(x_i, x_j) = \exp\left(-\gamma \|x_i - x_j\|^2\right)\]

Properties:

  • Local similarity measure
  • Small $\gamma$ → smooth boundary
  • Large $\gamma$ → complex / wiggly boundary

🔍 Why Kernel Works (Derivation)

From dual optimization:

\[\max_{\alpha} \sum \alpha_i - \frac{1}{2}\sum\sum \alpha_i \alpha_j y_i y_j x_i^T x_j\]

Replace inner product:

\[x_i^T x_j = \phi(x_i)^T \phi(x_j) = K(x_i,x_j)\]

Thus SVM works in high‑dim space without explicit mapping.


👥 SVM for Multi‑Class (K > 2)

SVM is inherently binary → need strategy.

One‑vs‑Rest (OvR)

Train K classifiers:

\[f_k(x) = w_k^T x + b_k\]

Prediction:

\[\hat{y} = \arg\max_k f_k(x)\]

One‑vs‑One (OvO)

Train pairwise classifiers:

\[\binom{K}{2}\]

Final prediction → majority voting.

Preferred when K is small.


🔬 Geometric Interpretation

Kernel transforms space so that nonlinear boundary in original space becomes linear hyperplane in feature space.


📊 Logistic vs SVM — Mathematical Difference

MethodObjective
Logistic$\sum \log(1 + e^{-y f(x)})$
SVM$\frac{1}{2}|w|^2 + C\sum \max(0, 1-yf(x))$

Logistic → probabilistic model
SVM → margin maximization


🚀 Final Summary

  • SVM maximizes margin → strong geometric classifier
  • Logistic gives probabilities → probabilistic interpretation
  • Kernel trick enables nonlinear boundaries
  • RBF kernel → local similarity
  • Multi‑class via OvR / OvO
  • Support vectors define decision boundary

End of Notes ✨

This post is licensed under CC BY 4.0 by the author.