5/5 - (1 vote)

Image Segmentation
How many zebras?
From Sandlot Science

Why context is important?
What is this?

Why is this a car?

because its on the road!
Why is this road?

Why is this a road?
Context is very important!

Same problem in real scenes

From images to objects
What defines an object?
Subjective problem, but has been well-studied
Proximity, similarity, continuation

Extracting objects
How could we do this automatically (or at least semi-automatically)?

Semi-automatic binary segmentation

Simplifying the user interaction
Grabcut [Rother et al., SIGGRAPH 2004]

Source: K. Grauman
Auto segmentation: toy example
black pixels
white pixels
These intensities define the three groups.
We could label every pixel in the image according to which of these primary intensities it is.
i.e., segment the image based on the intensity feature. But image isnt quite so simple
input image
pixel count

Source: K. Grauman
input image
Now how to determine the three main intensities that define our groups?
We need to cluster.
pixel count

Semantic Segmentation
Classification + Localization
Object Detection
Instance Segmentation
Deep Learning
GRASS, CAT, TREE, SKY
Pixel-level
Fei-Fei Li & &
CAT DOG, DOG, CAT
Single Object
Segmentation+Classification
DOG, DOG, CAT Multiple Object
Serena 11 13
May 10, 2017

Semantic Segmentation
Label each pixel in the image with a category label
Fei-Fei Li & &
Dont differentiate instances, only care about pixels
Serena 11

Semantic Segmentation Idea: Fully Convolutional
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!

Scores: C x H xW
Predictions: Hx W
May 10, 2017
Fei-Fei Li & &
Lecture 11

Convolutions
Each channel is a class C channels->C classes

Semantic Segmentation Idea: Fully Convolutional
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Fei-Fei Li & &
Serena 11

Downsampling: Pooling, strided convolution
Design network as a bunch of convolutional layers, with Upsampling: downsampling and upsampling inside the network! ???
Semantic Segmentation Idea: Fully Convolutional
Med-res: Med-res: D2 x H/4 x W/4 D2 x H/4 x W/4
Low-res: D3 x H/4 x W/4
Serena 11
High-res: D1 x H/2 x W/2
High-res: D1 x H/2 x W/2
Predictions: Hx W
May 10, 2017
Long, Shelhamer, and Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015 Noh et al, Learning Deconvolution Network for Semantic Segmentation, ICCV 2015
Fei-Fei Li & &

In-Network upsampling: Unpooling
Nearest Neighbor
Bed of Nails
Input: 2 x 2
Output: 4 x 4
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11

May 10, 2017

In-Network upsampling:

Remember which element was max!

Use positions from pooling layer
Rest of the network
Input: 4 x 4
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
Input: 2 x 2
Output: 4 x 4
Fei-Fei Li & &
Lecture 11

May 10, 2017

Fei-Fei Li & &
Lecture 11

3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2
Output: 4 x 4
May 10, 2017

3 x 3 transpose convolution, stride 2 pad 1
Fei-Fei Li & &
Lecture 11

Input: 2 x 2
Output: 4 x 4
Input gives weight for filter
May 10, 2017

3 x 3 transpose convolution, stride 2 pad 1
Input gives weight for filter
Sum where output overlaps
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
May 10, 2017
Fei-Fei Li & &
Lecture 11

Input: 2 x 2
Output: 4 x 4

Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Transpose Convolution: 1D Example
Fei-Fei Li & &
Lecture 11

Adapted from
May 10, 2017

Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
DUCK: (x, y, w, h) DUCK: (x, y, w, h) .
Fei-Fei Li & &
Lecture 11
May 10, 2017 24

Object Detection as Regression?
CAT: (x, y, w, h)
DOG: (x, y, w, h) DOG: (x, y, w, h) CAT: (x, y, w, h)
Fei-Fei Li & &
Lecture 11

DUCK: (x, y, w, h) DUCK: (x, y, w, h) .
16 numbers
Many numbers!
Each image needs a different number of outputs!
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11 26

Dog? NO Cat? NO
Background? YES
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11

Cat? NO Background? NO
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11

Cat? NO Background? NO
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11

Dog? NO Cat? YES Background? NO
May 10, 2017

Object Detection as Classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Fei-Fei Li & &
Lecture 11

Dog? NO Cat? YES Background? NO
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensMivea!y 10, 2017

Region Proposals
Find image regions that are likely to contain objects
Relatively fast to run; e.g. Selective Search gives 1000 region
proposals in a few seconds on CPU
Alexe et al, Measuring the objectness of image windows, TPAMI 2012 Uijlings et al, Selective Search for Object Recognition, IJCV 2013
Cheng et al, BING: Binarized normed gradients for objectness estimation at 300fps, CVPR 2014 Zitnick and Dollar, Edge boxes: Locating object proposals from edges, ECCV 2014
Fei-Fei Li & &
May 10, 2017
Serena 11 31

Alexe et al., CVPR 2010

Fei-Fei Li & &
Lecture 11 33