Image Segmentation
How many zebras?
From Sandlot Science
Why context is important?
What is this?
Why is this a car?
because its on the road!
Why is this road?
Why is this a road?
Context is very important!
Same problem in real scenes
From images to objects
What defines an object?
Subjective problem, but has been well-studied
Proximity, similarity, continuation
Extracting objects
How could we do this automatically (or at least semi-automatically)?
Semi-automatic binary segmentation
Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]
Auto segmentation: toy example
black pixels
white pixels
These intensities define the three groups.
We could label every pixel in the image according to which of these primary intensities it is.
i.e., segment the image based on the intensity feature. But image isnt quite so simple
input image
pixel count
input image
Now how to determine the three main intensities that define our groups?
We need to cluster.
pixel count
Semantic Segmentation
Classification + Localization
Object Detection
Instance Segmentation
Deep Learning
Single Object
DOG, DOG, CAT Multiple Object
Semantic Segmentation
Label each pixel in the image with a category label
Dont differentiate instances, only care about pixels
Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
Scores: C x H xW
Predictions: Hx W
Each channel is a class C channels->C classes
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Downsampling: Pooling, strided convolution
Design network as a bunch of convolutional layers, with Upsampling: downsampling and upsampling inside the network! ???
Semantic Segmentation Idea: Fully Convolutional
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
In-Network upsampling: Unpooling
Nearest Neighbor
Bed of Nails
Input: 2 x 2
Output: 4 x 4
Input: 2 x 2
Output: 4 x 4
In-Network upsampling:
Remember which element was max!
Use positions from pooling layer
Rest of the network
Input: 4 x 4
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
Input: 2 x 2
Output: 4 x 4
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2
Output: 4 x 4
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2
Output: 4 x 4
Input gives weight for filter
3 x 3 transpose convolution, stride 2 pad 1
Input gives weight for filter
Sum where output overlaps
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
Input: 2 x 2
Output: 4 x 4
Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Transpose Convolution: 1D Example
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
DUCK: (x, y, w, h) DUCK: (x, y, w, h) .
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
DUCK: (x, y, w, h) DUCK: (x, y, w, h) .
16 numbers
Many numbers!
Each image needs a different number of outputs!
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Dog? NO Cat? NO
Background? YES
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Cat? NO Background? NO
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Cat? NO Background? NO
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Dog? NO Cat? YES Background? NO
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Dog? NO Cat? YES Background? NO
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensMivea!y 10, 2017
Region Proposals
Find image regions that are likely to contain objects
Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, Measuring the objectness of image windows, TPAMI 2012 Uijlings et al, Selective Search for Object Recognition, IJCV 2013
Cheng et al, BING: Binarized normed gradients for objectness estimation at 300fps, CVPR 2014 Zitnick and Dollar, Edge boxes: Locating object proposals from edges, ECCV 2014
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
Detection without Proposals: YOLO
Input image 3 x H x W
Divide image into grid 7 x 7
Image a set of base boxes centered at each gridcell HereB=3
Within each grid cell:
Regress from each of the B base
boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
Predict scores for each of C classes (including background as a class)
7 x 7 x (5 * B + C)
Redmon et al, You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Liu et al, SSD: Single-Shot MultiBox Detector, ECCV 2016
This parameterization fixes the output size
Each cell predicts:
For each bounding box:
4 coordinates (x, y, w, h)
1 confidence value
Some number of class probabilities
For Pascal VOC:
2 bounding boxes / cell 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Split the image into a grid
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Each cell predicts boxes and confidences: P(Object)
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Each cell also predicts a probability P(Class | Object)
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Dining Table
Combine the box and class predictions
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Finally do non-maximum suppression and threshold detections
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
It also generalizes well to new domains
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
