Semantic Segmentation
π― Semantic Segmentation, FCN, Deconvolution & U-Net
1οΈβ£ Computer Vision Tasks Overview
Classification
- Input: Image
- Output: Single label
- No spatial information
Object Detection
- Output: Bounding boxes + labels
- Multiple objects possible
Semantic Segmentation
- Output: Pixel-wise class labels
- Every pixel classified
Instance Segmentation
- Pixel-wise + instance separation
2οΈβ£ Semantic Segmentation Problem Definition
Given image:
\[X \in \mathbb{R}^{H \times W \times C}\]Goal:
\[Y \in \{1,\dots,K\}^{H \times W}\]Each pixel is assigned class probability:
\[p_k(i,j) = \frac{\exp(a_k(i,j))}{\sum_{c=1}^{K}\exp(a_c(i,j))}\]Loss (pixel-wise cross entropy):
\[\mathcal{L} = -\sum_{i,j}\sum_{k} y_{k}(i,j)\log p_k(i,j)\]3οΈβ£ Naive Patch-Based Approach
For each pixel:
- Extract patch centered at (i,j)
- Classify independently
Problem:
- Heavy overlap
- Redundant convolution
- Inference cost = O(HW Γ patch cost)
4οΈβ£ Fully Convolutional Networks (FCN)
Instead of patch-wise:
Apply CNN over full image.
Let feature map:
\[F \in \mathbb{R}^{H' \times W' \times D}\]Final 1Γ1 convolution produces:
\[S \in \mathbb{R}^{H' \times W' \times K}\]Need:
\[H' = H, \quad W' = W\]But CNN reduces resolution via stride and pooling.
5οΈβ£ Downsampling Mathematics
1D convolution:
\[y[n] = \sum_{k=0}^{K-1} x[n\cdot s + k] w[k]\]Output size:
\[\left\lfloor \frac{N - K}{s} \right\rfloor + 1\]Stride > 1 β resolution reduction.
6οΈβ£ Transposed Convolution (Deconvolution)
Transpose convolution defined as:
\[y = X^T w\]Where convolution is:
\[y = X w\]Thus transpose convolution uses matrix transpose.
1D Example (Stride 2)
Input:
\[[a, b]\]Kernel:
\[[x, y, z]\]Output:
\[[ax, ay, az + bx, by, bz]\]Overlapping contributions are summed.
7οΈβ£ Output Size of Transposed Convolution
Given:
- Kernel size k
- Stride s
- Padding p
Output:
\[O = (I - 1)s - 2p + k\]This restores spatial resolution.
8οΈβ£ EncoderβDecoder Architecture
Encoder:
- Conv
- Pooling
- Stride
Decoder:
- Transposed convolution
- Upsampling
- Recover spatial resolution
9οΈβ£ U-Net Architecture
Originally for biomedical segmentation.
Structure:
Encoder (contracting path):
\[H \to H/2 \to H/4 \to H/8\]Decoder (expanding path):
\[H/8 \to H/4 \to H/2 \to H\]Skip connections:
\[F_{decoder} = \text{Concat}(F_{encoder}, F_{upsampled})\]This preserves high-resolution boundary information.
π U-Net Loss Function
Pixel-wise softmax:
\[p_k(x) = \frac{\exp(a_k(x))}{\sum_{c=1}^K \exp(a_c(x))}\]Weighted cross entropy:
\[\mathcal{L} = -\sum_{x \in \Omega} w(x)\log p_{l(x)}(x)\]Weight map:
\[w(x) = w_c(x) + w_0 \exp\left(-\frac{(d_1(x)+d_2(x))^2}{2\sigma^2}\right)\]Where:
- dβ, dβ: distances to nearest object boundaries
- Encourages learning borders
π₯ Why Weight Map?
Without it:
- Borders underrepresented
- Multiple instances merge
With weighting:
- Strong gradient on boundaries
- Better instance separation
π§ Final Summary
Model Key Idea Strength ββββ- βββββββ- βββββββ Patch-based Local classification Simple FCN Full image conv Efficient Deconv Learnable upsampling Resolution recovery U-Net Skip connections Sharp boundaries
π Key Insights
- Convolution reduces resolution via stride.
- Transposed convolution restores via matrix transpose.
- Skip connections recover fine details.
- Weighted loss improves boundary precision.
End of Complete Notes.