Tuesday, October 17, 2017. 12:00PM. NSH 3305.
Xiaolong Wang -- Learning Visual Representations for Object Detection
Abstract: Object detection is in the center of applications in computer vision. The current pipeline for training object detectors include ConvNet pre-training and fine-tuning. In this talk, I am going to cover our works on self-supervised/unsupervised ConvNet pre-training as well as optimization strategies on fine-tuning.
For ConvNet pre-training, instead of using millions of labeled images, we explored to learn visual representations using supervisions from the data itself without any human labels, i.e., self-supervised learning. Specifically, we proposed to exploit different self-supervised approaches to learn representations invariant to (i) inter-instance variations (two objects in the same class should have similar features) and (ii) intra-instance variations (viewpoint, pose, deformations, illumination). Instead of combining two approaches with multi-task learning, we organized the data with multiple variations in a graph and applied simple transitive rules to generate pairs of images with richer visual invariance for training. This approach brings the object detection accuracies on MSCOCO dataset less than 1% away from methods using large amount of labeled data (e.g., ImageNet).
For object detection fine-tuning, we proposed to train object detectors invariant to occlusions and deformations. The common solution is to use a data-driven strategy -- collect large-scale datasets which have object instances under different conditions. However, like categories, occlusions and object deformations also follow a long-tail. Some occlusions and deformations are so rare that they hardly happen; yet we want to learn a model invariant to such occurrences. In this talk, we propose to learn an adversarial network that generates examples with occlusions and deformations. The goal of the adversary is to generate examples that are difficult for the object detector to classify. In our framework both the original detector and adversary are learned in a joint manner. We show significant improvements on different datasets (VOC, COCO) with different network architectures (AlexNet, VGG16, ResNet101).