Post

Naive Bayes

Naive Bayes

๐Ÿค– Naive Bayes

Complete mathematical reconstruction with intuition and practical fixes
Covers Bayes rule, independence assumption, smoothing, and numerical stability.


1. ๐Ÿง  Bayes Optimal Classifier

Bayes risk:

\[f^* = \arg\min_f \mathbb{E}_{p(x,y)}[\mathcal{L}(y, f(x))]\]

Equivalent decision rule:

\[f^*(x) = \arg\max_y p(y \mid x)\]

Using Bayes rule:

\[p(y \mid x) = \frac{p(x \mid y)p(y)}{p(x)}\]

Since \(p(x)\) is constant across classes:

\[f^*(x) = \arg\max_y p(x \mid y)p(y)\]

2. ๐Ÿ“ฆ Naive Bayes Assumption

For feature vector:

\[x = (x_1,\dots,x_d)\]

Naive Bayes assumes conditional independence:

\[p(x \mid y) = \prod_{j=1}^{d} p(x_j \mid y)\]

Meaning:

  • Features are independent given the class
  • Simplifies high-dimensional probability estimation

3. ๐Ÿค” Why Naive Bayes Works (Even When False)

Independence assumption is usually incorrect, but classification still works well because:

  • True joint distribution $$p(xy)$$ may be wrong
  • But class ranking often preserved
\[p(x|y_1)p(y_1) > p(x|y_2)p(y_2)\]

โœ” Correct ordering โ†’ correct classification

This explains Naive Bayesโ€™ surprisingly strong performance.


4. ๐ŸŒค Weather Dataset Example

Goal:

\[P(\text{PlayGolf}=yes \mid x) \quad vs \quad P(\text{PlayGolf}=no \mid x)\]

Given:

\[x = (\text{Outlook=sunny, Temp=Hot, Humidity=High, Windy=False})\]

4.1 Prior Probabilities

From dataset:

\[P(yes) = \frac{9}{14}, \quad P(no) = \frac{5}{14}\]

4.2 Likelihood via Naive Factorization

For class = yes

\[P(x|yes) = P(\text{sunny}|yes) P(\text{Hot}|yes) P(\text{High}|yes) P(\text{False}|yes)\]

From counts:

\[= \frac{3}{9} \cdot \frac{2}{9} \cdot \frac{3}{9} \cdot \frac{6}{9} \approx 0.016\]

Posterior (unnormalized):

\[P(yes|x) \propto 0.016 \cdot \frac{9}{14}\]

For class = no

\[P(x|no) = P(\text{sunny}|no) P(\text{Hot}|no) P(\text{High}|no) P(\text{False}|no)\]

From counts:

\[= \frac{2}{5} \cdot \frac{2}{5} \cdot \frac{4}{5} \cdot \frac{2}{5} \approx 0.051\]

Posterior:

\[P(no|x) \propto 0.051 \cdot \frac{5}{14}\]

4.3 Prediction

Since:

\[P(no|x) > P(yes|x)\]

Prediction:

\[\hat y = no\]

5. โš ๏ธ Zero-Frequency Problem

If any conditional probability is zero:

\[p(x_j|y) = 0\]

Then:

\[p(x|y) = 0\]

โ†’ catastrophic failure (class impossible).


5.1 Laplace Smoothing (Additive Smoothing)

Add small constant \(\epsilon\) (usually 1):

\[p(x_j|y) = \frac{count(x_j,y) + \epsilon}{count(y) + \epsilon K}\]

Where:

  • \(K\) = number of possible values of feature \(x_j\)

โœ” Prevents zero probabilities
โœ” Stabilizes estimation with small data


6. ๐Ÿ”ข Numerical Stability โ€” Log Trick

Probability product becomes extremely small:

\[p(x|y) = \prod_j p(x_j|y)\]

Instead compute in log space:

\[\log p(x|y) = \sum_j \log p(x_j|y)\]

Decision rule unchanged:

\[\arg\max_y \log p(x|y) + \log p(y)\]

โœ” Avoids floating-point underflow
โœ” Faster computation


7. ๐Ÿš€ Key Insights

  • Naive Bayes assumes conditional independence (rarely true)
  • Still works because class ranking preserved
  • Laplace smoothing fixes zero-probability problem
  • Log-probability prevents numerical underflow
  • Extremely fast & scalable
  • Performs surprisingly well in:
    • Text classification
    • Spam detection
    • Document categorization
    • High-dimensional sparse data

This post is licensed under CC BY 4.0 by the author.