Instance Segmentation
๐ฏ Instance Segmentation & Mask R-CNN โ Complete Mathematical Notes
1๏ธโฃ Instance Segmentation Problem
Instance Segmentation =
\[\text{Object Detection} + \text{Semantic Segmentation}\]Goal:
For each object instance:
- Class label
- Bounding box
- Pixel-wise mask
Output:
\[\{ (c_i, b_i, M_i) \}_{i=1}^{N}\]Where:
- ( c_i ): class
- ( b_i
\in{=tex}\mathbb{R}{=tex}\^4 ): bounding box - ( M_i
\in{=tex}{0,1}\^{m\times{=tex}m} ): binary mask
2๏ธโฃ Mask R-CNN Overview
๐ Paper: https://arxiv.org/abs/1703.06870
Mask R-CNN extends Faster R-CNN by adding a mask prediction branch.
Architecture:
Image โ Backbone โ RPN โ RoIAlign โ Parallel Heads:
- Classification
- Bounding box regression
- Mask prediction
3๏ธโฃ RoIAlign (Critical Component)
Problem with RoIPool:
- Quantization of coordinates
- Misalignment
RoIAlign solution:
For sampling location (x, y):
Use bilinear interpolation:
\[f(x,y) = \sum_{i,j} w_{ij} f(x_i, y_j)\]Where weights depend on distance.
Preserves spatial correspondence.
4๏ธโฃ Mask Head
For each RoI:
Output mask:
\[\hat{M} \in \mathbb{R}^{K \times m \times m}\]Important:
- Fully convolutional
- No FC layers
- Maintains spatial info
During inference:
Select mask corresponding to predicted class.
5๏ธโฃ Multi-Task Loss
Total loss:
\[\mathcal{L} = \mathcal{L}_{cls} + \lambda \mathcal{L}_{box} + \lambda' \mathcal{L}_{mask}\]๐ท Classification Loss
Cross-entropy:
\[\mathcal{L}_{cls} = -\sum_i y_i \log \hat{y}_i\]๐ท Bounding Box Loss
Smooth L1:
\[\mathcal{L}_{box} = \sum_i \text{SmoothL1}(b_i - \hat{b}_i)\]Where:
\[\text{SmoothL1}(x) = \begin{cases} 0.5x^2 & |x|<1 \\ |x|-0.5 & \text{otherwise} \end{cases}\]๐ท Mask Loss
Binary cross-entropy per pixel:
\[\mathcal{L}_{mask} = -\sum_{i}\sum_{p} \left[ M_i(p)\log \hat{M}_i(p) + (1-M_i(p))\log (1-\hat{M}_i(p)) \right]\]Only computed for positive RoIs.
6๏ธโฃ Why Separate Mask Head?
Mask branch:
- Independent from classification
- Per-pixel supervision
- Improves detection performance
Key insight:
Decoupling mask and class prediction improves accuracy.
7๏ธโฃ Mathematical Insight
Mask prediction is:
\[\text{Dense classification inside each RoI}\]Different from semantic segmentation:
- Works on region proposals
- Instance-specific
8๏ธโฃ Advantages
โ High accuracy
โ Clean multi-task learning
โ Minimal modification to Faster R-CNN
9๏ธโฃ Limitations
โ Two-stage โ slower
โ RoI-based โ memory heavy
โ Hard to scale to very dense scenes
๐ฅ Final Summary
Component Purpose โโโโ โโโโโโโโโ RPN Generate proposals RoIAlign Precise alignment Class head Category prediction Box head Localization Mask head Pixel-level instance mask
๐ง Key Takeaways
- RoIAlign fixes quantization error.
- Mask head is fully convolutional.
- Multi-task loss jointly optimizes detection + mask.
- Mask R-CNN is still a strong baseline.