Object Detection

To what extent do [Krizhevsky et. al’s results] generalize to object detection?

Object detection is the task of finding the different objects in an image and classifying them

mAP (mean average precision)

This blog explain the mAP clearly.

RCNN

The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image.

Inputs: Image
Outputs: Bounding boxes + labels for each object in the image.

region proposals:

At a high level, Selective Search (shown in the image above) looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.

feature extraction:

They extract a 4096-dimensional feature vector from each region proposal using the Caffe implementation of the CNN described by Krizhevsky et al.. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers.

Since the sizes of boxes from region proposals varies, they need firstly warp image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227 × 227 pixel size)

training:

Supervised pre-training

They discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding-box labels are not available for this data).

Domain-specific fine-tuning.

To adapt our CNN to the new task (detection) and the new domain (warped proposal windows), we continue stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with a bias ratio mini-batch amoung negative and positive samples which are assigned according to the IoU larger than 0.5 or not.

Object category classifiers.

Similar to the step above, given threshold split the data set into negative and positve samples for classification in SVM.

Summary

Generate a set of proposals for bounding boxes.
Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.

Problems:

Training is a multi-stage pipeline. R-CNN first fine-tunes a ConvNet on object proposals using log loss. Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. In the third training stage, bounding-box regressors are learned.
Training is expensive in space and time. For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage.
Object detection is slow. At test-time, features are extracted from each object proposal in each test image. Detection with VGG16 takes 47s / image (on a GPU).

Fast-RCNN

Architecture

The input images as well as a set of region proposals are forwarded through a conv network and projected to a feature map. The feature map for each projected region proposal is processed via the RoI pooling layer and FCs into a fix-size RoI feature vector. Each feature vector is then fed into a sequence of fully connected layers (FC) that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

RoI (Region of Interest) pooling layer

Similary to the warping in RCNN to process the region proposal into a fix-size input of fine-tune netowrk, the RoI max pooling works by dividing the h w RoI window into an H W grid of sub-windows of approximate size h=H w=W and then max-pooling the values in each sub-window into the corresponding output grid cell.

Training

step1: pre-trained:
Amoung three pre-trained network from ImageNet the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer and last fully connected layer and softmax are replaced with the two sibling layers.
step2: fine-tune:
- summary:
  - mini-batch:
    In Fast RCNN training, stochastic gradient descent (SGD) minibatches are sampled hierarchically, first by sampling N images and then by sampling R=N RoIs from each image.
  - end-to-end:
    Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages.
- loss: multi-task loss:
  $L(p, u, t^u, v) = L_{cls}(p, u) + \lambda_{[u > 0]}L_{loc}(t^u, v)$
  Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:
  - $L_{cls}(p(x), u) = - log p_u$
    classification loss is log loss for true class u
  - $L_{loc}(t^u, v) = \sum_{i \in {x, y, w, h}} smooth_{L_1} (t_i^u - v)$
    is a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients.
- mini-batch:
  - sharing weights:
    each SGD mini-batch is constructed from N = 2 images, chosen uniformly at random (as is common practice, we actually iterate over permutations of the dataset). We use mini-batches of size R = 128, sampling 64 RoIs from each image.
  - ratio: Similar to the ratio mini-batch in RCNN, 25 % RoIs IoU at least 0.5 are as positive images (with ground truth labels), [0.1, 0.5) are as back-ground.