Post

Instance Segmentation

Instance Segmentation

๐ŸŽฏ Instance Segmentation & Mask R-CNN โ€“ Complete Mathematical Notes


1๏ธโƒฃ Instance Segmentation Problem

Instance Segmentation =

\[\text{Object Detection} + \text{Semantic Segmentation}\]

Goal:

For each object instance:

  • Class label
  • Bounding box
  • Pixel-wise mask

Output:

\[\{ (c_i, b_i, M_i) \}_{i=1}^{N}\]

Where:

  • ( c_i ): class
  • ( b_i \in {=tex}\mathbb{R}{=tex}\^4 ): bounding box
  • ( M_i \in {=tex}{0,1}\^{m \times {=tex}m} ): binary mask

2๏ธโƒฃ Mask R-CNN Overview

๐Ÿ“„ Paper: https://arxiv.org/abs/1703.06870

Mask R-CNN extends Faster R-CNN by adding a mask prediction branch.

Architecture:

Image โ†’ Backbone โ†’ RPN โ†’ RoIAlign โ†’ Parallel Heads:

  • Classification
  • Bounding box regression
  • Mask prediction

3๏ธโƒฃ RoIAlign (Critical Component)

Problem with RoIPool:

  • Quantization of coordinates
  • Misalignment

RoIAlign solution:

For sampling location (x, y):

Use bilinear interpolation:

\[f(x,y) = \sum_{i,j} w_{ij} f(x_i, y_j)\]

Where weights depend on distance.

Preserves spatial correspondence.


4๏ธโƒฃ Mask Head

For each RoI:

Output mask:

\[\hat{M} \in \mathbb{R}^{K \times m \times m}\]

Important:

  • Fully convolutional
  • No FC layers
  • Maintains spatial info

During inference:

Select mask corresponding to predicted class.


5๏ธโƒฃ Multi-Task Loss

Total loss:

\[\mathcal{L} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{box} + \lambda' \mathcal{L}_{mask}\]

๐Ÿ”ท Classification Loss

Cross-entropy:

\[\mathcal{L}_{cls} = -\sum_i y_i \log \hat{y}_i\]

๐Ÿ”ท Bounding Box Loss

Smooth L1:

\[\mathcal{L}_{box} = \sum_i \text{SmoothL1}(b_i - \hat{b}_i)\]

Where:

\[\text{SmoothL1}(x) = \begin{cases} 0.5x^2 & |x|<1 \\ |x|-0.5 & \text{otherwise} \end{cases}\]

๐Ÿ”ท Mask Loss

Binary cross-entropy per pixel:

\[\mathcal{L}_{mask} = -\sum_{i}\sum_{p} \left[ M_i(p)\log \hat{M}_i(p) + (1-M_i(p))\log (1-\hat{M}_i(p)) \right]\]

Only computed for positive RoIs.


6๏ธโƒฃ Why Separate Mask Head?

Mask branch:

  • Independent from classification
  • Per-pixel supervision
  • Improves detection performance

Key insight:

Decoupling mask and class prediction improves accuracy.


7๏ธโƒฃ Mathematical Insight

Mask prediction is:

\[\text{Dense classification inside each RoI}\]

Different from semantic segmentation:

  • Works on region proposals
  • Instance-specific

8๏ธโƒฃ Advantages

โœ” High accuracy
โœ” Clean multi-task learning
โœ” Minimal modification to Faster R-CNN


9๏ธโƒฃ Limitations

โŒ Two-stage โ†’ slower
โŒ RoI-based โ†’ memory heavy
โŒ Hard to scale to very dense scenes


๐Ÿ”ฅ Final Summary

Component Purpose โ€”โ€”โ€”โ€” โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€” RPN Generate proposals RoIAlign Precise alignment Class head Category prediction Box head Localization Mask head Pixel-level instance mask


๐Ÿง  Key Takeaways

  1. RoIAlign fixes quantization error.
  2. Mask head is fully convolutional.
  3. Multi-task loss jointly optimizes detection + mask.
  4. Mask R-CNN is still a strong baseline.
This post is licensed under CC BY 4.0 by the author.