Image Segmentation
How many zebras?
From Sandlot Science
Copyright By Assignmentchef assignmentchef
Why context is important?
What is this?
Why is this a car?
because its on the road!
Why is this road?
Why is this a road?
Context is very important!
Same problem in real scenes
From images to objects
What defines an object?
Subjective problem, but has been well-studied
Proximity, similarity, continuation
Extracting objects
How could we do this automatically (or at least semi-automatically)?
Semi-automatic binary segmentation
Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]
Source: K. Grauman
Auto segmentation: toy example
black pixels
white pixels
These intensities define the three groups.
We could label every pixel in the image according to which of these primary intensities it is.
i.e., segment the image based on the intensity feature. But image isnt quite so simple
input image
pixel count
Source: K. Grauman
input image
Now how to determine the three main intensities that define our groups?
We need to cluster.
pixel count
Semantic Segmentation
Classification + Localization
Object Detection
Instance Segmentation
Deep Learning
GRASS, CAT, TREE, SKY
Pixel-level
Fei-Fei Li & &
CAT DOG, DOG, CAT
Single Object
Segmentation+Classification
DOG, DOG, CAT Multiple Object
Serena 11 13
May 10, 2017
Semantic Segmentation
Label each pixel in the image with a category label
Fei-Fei Li & &
Dont differentiate instances, only care about pixels
Serena 11
Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
Scores: C x H xW
Predictions: Hx W
May 10, 2017
Fei-Fei Li & &
Lecture 11
Convolutions
Each channel is a class C channels->C classes
Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Fei-Fei Li & &
Serena 11
Downsampling: Pooling, strided convolution
Design network as a bunch of convolutional layers, with Upsampling: downsampling and upsampling inside the network! ???
Semantic Segmentation Idea: Fully Convolutional
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
Serena 11
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Fei-Fei Li & &
In-Network upsampling: Unpooling
Nearest Neighbor
Bed of Nails
Input: 2 x 2
Output: 4 x 4
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11
May 10, 2017
In-Network upsampling:
Remember which element was max!
Use positions from pooling layer
Rest of the network
Input: 4 x 4
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11
May 10, 2017
Fei-Fei Li & &
Lecture 11
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2
Output: 4 x 4
May 10, 2017
3 x 3 transpose convolution, stride 2 pad 1
Fei-Fei Li & &
Lecture 11
Input: 2 x 2
Output: 4 x 4
Input gives weight for filter
May 10, 2017
3 x 3 transpose convolution, stride 2 pad 1
Input gives weight for filter
Sum where output overlaps
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
May 10, 2017
Fei-Fei Li & &
Lecture 11
Input: 2 x 2
Output: 4 x 4
Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Transpose Convolution: 1D Example
Fei-Fei Li & &
Lecture 11
Adapted from
May 10, 2017
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
DUCK: (x, y, w, h) DUCK: (x, y, w, h) .
Fei-Fei Li & &
Lecture 11
May 10, 2017 24
Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
Fei-Fei Li & &
Lecture 11
DUCK: (x, y, w, h) DUCK: (x, y, w, h) .
16 numbers
Many numbers!
Each image needs a different number of outputs!
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 26
Dog? NO Cat? NO
Background? YES
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11
Cat? NO Background? NO
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11
Cat? NO Background? NO
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11
Dog? NO Cat? YES Background? NO
May 10, 2017
Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11
Dog? NO Cat? YES Background? NO
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensMivea!y 10, 2017
Region Proposals
Find image regions that are likely to contain objects
Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, Measuring the objectness of image windows, TPAMI 2012 Uijlings et al, Selective Search for Object Recognition, IJCV 2013
Cheng et al, BING: Binarized normed gradients for objectness estimation at 300fps, CVPR 2014 Zitnick and Dollar, Edge boxes: Locating object proposals from edges, ECCV 2014
Fei-Fei Li & &
May 10, 2017
Serena 11 31
Alexe et al., CVPR 2010
Fei-Fei Li & &
Lecture 11 33
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
May 10, 2017
Fei-Fei Li & &
Lecture 11
Girshick et al., Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
May 10, 2017
Detection without Proposals: YOLO
Input image 3 x H x W
Divide image into grid 7 x 7
Image a set of base boxes centered at each gridcell HereB=3
Within each grid cell:
Regress from each of the B base
boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
Predict scores for each of C classes (including background as a class)
7 x 7 x (5 * B + C)
May 10, 2017
Redmon et al, You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Liu et al, SSD: Single-Shot MultiBox Detector, ECCV 2016
Fei-Fei Li & &
Serena 11
This parameterization fixes the output size
Each cell predicts:
For each bounding box:
4 coordinates (x, y, w, h)
1 confidence value
Some number of class probabilities
For Pascal VOC:
2 bounding boxes / cell 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Split the image into a grid
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Each cell predicts boxes and confidences: P(Object)
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Each cell also predicts a probability P(Class | Object)
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Dining Table
Combine the box and class predictions
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
Finally do non-maximum suppression and threshold detections
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
It also generalizes well to new domains
Redmon et al., You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.