Softmax and Cross Entropy Loss

Posted Feb 9, 2026

3 min read

🟨 Limitations of Linear Classifier & Probabilistic Interpretation

From raw scores to probabilities, and why loss functions like cross-entropy naturally arise.
Full derivations included, no steps skipped ✍️

1️⃣ What is the Meaning of the Score?

A linear classifier outputs scores: $s_k = w_k^T x + b_k$

$s_k$ is not a probability
Only relative differences between scores matter

Binary case (2 classes)

Let: $s = s_1 - s_2$

If $s_1 > s_2$ → class 1 more likely
Larger gap $$ s_1 - s_2 $$ → higher confidence

But: ❌ Scores are unbounded
❌ Not normalized
❌ Not probabilities

👉 We need a mapping to [0,1]

2️⃣ Sigmoid Function (Binary Classification)

We want a function such that:

Large positive $s$ → close to 1
Large negative $s$ → close to 0
$s=0$ → 0.5

Sigmoid definition

$\sigma(s) = \frac{1}{1 + e^{-s}}$

Apply to score difference

$p(y=c_1 \mid x) = \frac{1}{1 + e^{-(s_1 - s_2)}}$

\[p(y=c_2 \mid x) = \frac{1}{1 + e^{-(s_2 - s_1)}}\]

✅ Now we have probabilities
❌ But only works for 2 classes

3️⃣ Softmax Classifier (Multi-class)

For $C > 2$ classes, we generalize sigmoid.

Softmax definition

$p(y = c_k \mid x) = \frac{e^{s_k}}{\sum_{j=1}^{C} e^{s_j}}$

Check probability properties

$0 \le p(y=c_k\mid x) \le 1$
$\sum_{k=1}^C p(y=c_k\mid x) = 1$

✅ Valid probability distribution
✅ Differentiable (important for training)

4️⃣ Probabilistic Setting for Classification

Ground truth labels

Binary

$y \in \{0,1\}$

Multi-class

One-hot vector: $y = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} \in \{0,1\}^C$

with: $\sum_{k=1}^C y_k = 1$

Model output

Predicted probabilities: $\hat{y}_k = p(y=c_k \mid x)$

Now we compare:

True distribution $y$
Predicted distribution $\hat{y}$

👉 This naturally leads to distribution distance measures

5️⃣ Cross Entropy Loss (General Form)

From information theory:

Cross entropy measures the expected number of bits needed to encode events from $P$ using a code optimized for $Q$.

General multi-class loss

$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{C} y_{ik} \log(\hat{y}_{ik})$

6️⃣ Cross Entropy Simplification (Key Insight)

Because one-hot labels satisfy: $y_{ik} = 1 \text{ for only one } k$

All other terms vanish:

\[\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log(\hat{y}_{i,T_i})\]

Where:

$T_i$ = true class index of sample $i$

👉 Interpretation:

Loss = sum of −log(predicted probability of the true class)

7️⃣ Binary Cross Entropy (Special Case)

For binary classification:

\[\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \Big[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \Big]\]

Behavior

If $\hat{y} \to 1$ and $y=1$ → loss → 0
If $\hat{y} \to 0$ and $y=1$ → loss → ∞

🚨 Confident wrong predictions are heavily penalized

8️⃣ Why Logarithm?

Consider $-\log(x)$:

\[x \in (0,1]\]
$x \to 1$ → $-\log(x) \to 0$
$x \to 0$ → $-\log(x) \to \infty$

This gives:

Smooth gradients
Strong penalty for confident mistakes

9️⃣ KL Divergence (Connection)

Definition

For two distributions $P$ (true) and $Q$ (predicted):

\[D_{KL}(P \| Q) = \sum_i P(i)\log\frac{P(i)}{Q(i)}\]

(or continuous form)

\[D_{KL}(P \| Q) = \int P(x)\log\frac{P(x)}{Q(x)}dx\]

Properties

\[D_{KL}(P\|Q) \ge 0\]
Asymmetric: $D_{KL}(P\|Q) \ne D_{KL}(Q\|P)$
Not a true metric (no triangle inequality)

🔗 Cross Entropy vs KL Divergence

Cross entropy: $H(P,Q) = H(P) + D_{KL}(P\|Q)$

Since $H(P)$ is fixed (ground truth):

\[\arg\min_Q H(P,Q) = \arg\min_Q D_{KL}(P\|Q)\]

✅ Minimizing cross entropy = minimizing KL divergence

🔚 Final Takeaways

Linear classifier outputs scores, not probabilities
Sigmoid → binary probabilities
Softmax → multi-class probabilities
Cross entropy naturally measures mismatch between:
- true distribution
- predicted distribution
Training = minimizing KL divergence in disguise ✨

🔥 This forms the backbone of modern deep learning classifiers.

Artificial Intelligence, Artificial Intelligence - Optimization

This post is licensed under CC BY 4.0 by the author.