Post

Entry of Classification, Linear Regression vs Logistic Regression

Entry of Classification, Linear Regression vs Logistic Regression

๐ŸŽฏ Classification โ€” Linear Regression vs Logistic Regression


๐Ÿ“Œ Overview

In classification problems, our goal is to predict a discrete class label rather than a continuous value.

This post explains:

  • Why Linear Regression is not suitable for classification โŒ
  • Why Logistic Regression is the correct probabilistic model โœ…
  • The mathematical intuition behind both approaches ๐Ÿ“

๐Ÿง  Classification Basics

๐Ÿ”Ž What is Classification?

Given:

  • Feature vector: \(\mathbf{x}\)
  • Class label: \(y \in C\)

We learn a function:

\[f(\mathbf{x}) \in C\]

Often, we prefer probabilities instead of hard labels:

\[P(y = c \mid \mathbf{x})\]

๐Ÿ’ก Why Probabilities Matter

Probabilistic outputs enable:

  • ๐ŸŽฏ Riskโ€‘based decision making
  • โš™๏ธ Threshold tuning
  • ๐Ÿ’ฐ Costโ€‘sensitive classification
  • ๐Ÿ“Š Confidence estimation

Example: Fraud detection โ†’ probability is more valuable than a binary decision.


โš ๏ธ Can Linear Regression Be Used for Classification?

Binary Encoding

\[y = \begin{cases} 0 & \text{No} \\ 1 & \text{Yes} \end{cases}\]

One might try:

\[\hat{y} > 0.5 \Rightarrow \text{Class 1}\]

๐Ÿ‘ Why It Sometimes Works

Because:

\[\mathbb{E}[y \mid \mathbf{x}] = P(y=1 \mid \mathbf{x})\]

Linear regression can approximate probabilities in limited cases.


โŒ Major Problems

1. Predictions outside [0,1]

Linear regression may produce:

  • Negative probabilities โŒ
  • Probabilities > 1 โŒ

Which is invalid for probability modeling.


2. Multiclass Problem

Numeric coding introduces fake ordering:

\[1=\text{stroke},\quad 2=\text{overdose},\quad 3=\text{seizure}\]

Implies meaningless distance relationships โ†’ โŒ incorrect structure.


๐Ÿšซ Conclusion

Linear regression is not suitable for classification.

Better alternatives:

  • Logistic Regression โœ…
  • Softmax / Multinomial Logistic Regression
  • LDA / QDA
  • Probabilistic classifiers

๐Ÿ”ท Logistic Regression

๐ŸŽฏ Goal

Model probability:

\[P(y=1 \mid \mathbf{x})\]

We need a function mapping:

\[(-\infty,+\infty) \rightarrow (0,1)\]

๐Ÿ“ˆ Sigmoid Function

\[\sigma(s) = \frac{1}{1+e^{-s}}\]

Properties:

  • Smooth & monotonic ๐Ÿ“ˆ
  • Valid probability output ๐ŸŽฏ
  • Basis of Logistic Regression

๐Ÿ“Š Logistic Model

\[P(y=1 \mid \mathbf{x}) = \frac{1}{1+e^{-\boldsymbol{\beta}^T\mathbf{x}}}\] \[P(y=0 \mid \mathbf{x}) = 1 - P(y=1 \mid \mathbf{x})\]

๐Ÿ” Interpretation

Logโ€‘Odds (Logit)

\[\log \frac{p}{1-p} = \boldsymbol{\beta}^T \mathbf{x}\]

Meaning:

  • Logistic regression is linear in logโ€‘odds, not probability.

Coefficient Meaning

If:

\[\hat{\beta}_1 > 0\]

โ†’ Increasing feature increases probability of class 1.

Each +1 unit change in feature increases logโ€‘odds by \(\beta_1\).


๐Ÿ“ Maximum Likelihood Estimation (MLE)

Likelihood

For binary outcome:

\[P(y_i=1|\mathbf{x}_i)=\sigma(\boldsymbol{\beta}^T\mathbf{x}_i)\]

Dataset likelihood:

\[\mathcal{L}(\boldsymbol{\beta})=\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i}\]

Logโ€‘Likelihood

\[\ell(\boldsymbol{\beta})=\sum_{i=1}^n \left[ y_i \log p_i + (1-y_i)\log(1-p_i) \right]\]

Equivalent to minimizing crossโ€‘entropy loss.


๐Ÿงฎ Linear vs Logistic โ€” Key Differences

FeatureLinear RegressionLogistic Regression
OutputReal valueProbability (0โ€“1)
TaskRegressionClassification
Valid ProbabilitiesโŒโœ…
Decision BoundaryLinearLinear (in logโ€‘odds)
OptimizationLeast SquaresMLE / Crossโ€‘Entropy
Multiclass ExtensionโŒSoftmax

๐Ÿš€ Key Takeaways

  • Linear regression can approximate classification but is not probabilistically valid โŒ
  • Logistic regression models true probabilities using sigmoid โœ…
  • Model is linear in logโ€‘odds ๐Ÿ“
  • Estimated using Maximum Likelihood ๐Ÿ“Š
  • Foundation of modern classification methods ๐Ÿง 

๐Ÿ“š (Optional Extensions)

  • ๐Ÿ”น Regularization (L1 / L2)
  • ๐Ÿ”น Multiclass Softmax Regression
  • ๐Ÿ”น Decision boundary geometry
  • ๐Ÿ”น Gradient / Hessian derivation
  • ๐Ÿ”น Newton / IRLS optimization
This post is licensed under CC BY 4.0 by the author.