Author: Zongwei Zhou | 周纵苇
Weibo: @MrGiovanni
Email: zongweiz@asu.edu

Weakly supervised object localization refers to learning object locations from an image relied only on image-level annotation.

Overview

There have been a number of ongoing investigations regarding weakly supervised object localization relied only on image-level annotation, covering Self-Transfer Learning based (Hwang et al.), Attention based (Kazi et al.), and Multiple Instance Learning based (Li et al.) models. One of the most recognized approaches is class activation map (CAM), introduced by Zhou et al., which is produced from a trained Convolutional Neural Network (CNN) for classification. This localization ability is generic and encouraging number of medical applications in weakly supervised disease localization. For example, Wang et al. calculated the CAMs from the multi-class CNN and generated the bounding boxes for each pathology candidate in X-rays; Gondal et al. can handle multiple diabetic retinopathy lesions in one retinal fundus image by considering multiple binarized region proposals from CAM; Qi et al. achieved better localization performance by combining CAMs of four typically observed in a placental image together.

Despite CAM is promising in image-level detection, it may not be a proper approach for accurate object localization, which demands both rough location and approximate size of the object, as it only focuses on the most discriminative region of the object, ignoring the rest of the object. For example, when applying CAM to localize cats, it focuses on only one of the most discriminative areas such as face, body, or tail of the cat, whereas fails to outline the whole cat. That is, measured by detection metrics, CAM behaves appropriately while measured by such critical metrics as IoU and Dice, CAM performs merely fair. For this reason, many studies intend to improve the CAM towards solving a more ambitious task -- semantic segmentation relied only on image-level annotation.

In Teh et al., attention mechanism has been introduced to guide classifier learn a more discriminative object region using the traditional region proposal method, which is computational cost and time-consuming in practice. Further, CAM has been utilized as the new region proposal method and regularized as an attention mask to reveal more discriminative regions. Kim et al. proposed two-phase learning, using the CAM from the pre-trained CNN as suppression mask, and then training the second CNN by adding this suppression mask to the intermediate feature maps. Merge the two CAMs from both CNNs together to get a more accurate localization performance. This approach, however, is not suitable for the medical image because the class of the object has to appear in the pre-trained task. González-Gonzalo et al. slightly improved diabetic retinopathy lesions localization accuracy in an iterative manner by inpainting input image base on the previous predicted CAM. Wei et al. trained several CNNs independently for adversarial erasing (AE) and adopted a recursive manner to generate localization map until the classification CNN training is failed. Zhang et al. proposed an improved version of AE, named Adversarial Complementary Learning (ACoL) by integrating those independent CNNs into a single network and training it end-to-end. Nevertheless, ACoL can only approximate the same-quality maps as CAM, but in a more convenient way. Another recent attempt by Singh et al. was to encourage CNN to focus on multiple relevant parts of the object beyond just the most discriminative one, by randomly remove patches from the input image.

The success of aforementioned variants of CAM- and Attention- based approaches, in essence, shares two consequential presumptions. First, the accuracy of CNN should be promising to conduct a meaningful prediction for the classification task; in other words, if CNN performs poorly, due to some reasons such as the limited labeled data, this technology will be powerless in object localization. Second, built on a well-trained CNN and follow-up the outstanding results for object classification, the class activation map is supposed to activate the discriminative regions of the object. Unfortunately, it is only an intuitive assumption and experiments have demonstrated the effectiveness under most scenarios, but no solid theory proves the validity. The discriminative region is not always equivalent to the target object itself. Thereby, training a promising CNN in classification task is a necessary but not sufficient condition in utilizing CAM for weakly supervised object localization.

Novel Technical Approaches in Computer Vision

Is object localization for free?
Oquab, Maxime, et al. "Is object localization for free?-weakly-supervised learning with convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
https://leon.bottou.org/publications/pdf/cvpr-2015.pdf

It is probably the first to describe a weakly supervised object localization using CNNs. However, their localization is limited to a point lying in the boundary of the object rather than determining the full extent of the object, limited by the global max pooling. The max pooling rather than average pooling was used is because the task was formulated as a multiple instance learning (MIL) problem. Personally, I prefer this work more than Zhou et al., and especially admire their proposed question: is object localization with convolutional neural networks for free? The method is close enough to CAM, described as following:
First, we treat the last fully connected network layers as convolutions to cope with the uncertainty in object localization.
Second, we introduce a max-pooling layer that hypothesizes the possible location of the object in the image.
Third, we modify the cost function to learn from image-level supervision.

Class Activation Map (CAM)
Zhou, Bolei, et al. "Learning deep features for discriminative localization." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf

The major message of the paper very focuses: invite global average layer and shed light on how it explicitly enables CNN to have remarkable localization ability despite being trained on image-level labels. Though CAM works well in object localization, it has low precision, covering both relevant and non-relevant (noise activations and backgrounds) regions; it produces low-resolution maps, impeding precise object localization when the input image size is small.

Self-transfer Learning
Hwang, Sangheum, and Hyo-Eun Kim. "Self-transfer learning for weakly supervised lesion localization." International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, Cham, 2016.
https://link.springer.com/chapter/10.1007/978-3-319-46723-8_28

It works like Multi-task Learning (MTL).

Attention Networks
Teh, Eu Wern, Mrigank Rochan, and Yang Wang. "Attention Networks for Weakly Supervised Object Localization." BMVC. 2016.
http://www.cs.umanitoba.ca/~ywang/papers/bmvc16_attention.pdf

Edge boxes method extracts proposals (bounding boxes) that are likely to contain any object. Each proposal is passed to a linear layer to obtain its attention score. Then apply the softmax operation to the attention scores before multiplying it with its corresponding proposal features. This gives a whole image feature vector that is the weighted average of proposals. Finally, the whole image feature is used to classify the image.
This paper introduces proposal attention to implicitly locate the object by learning the contribution of each proposed region towards the final classification results. Similar with R-CNN, the region proposal approach (Edge boxes method), however, are adopted as external modules independent of the network, so the region locations are not adaptive with the expressive features learned by model training. Also, it cannot handle multiple objects in the image because only one proposal with the highest attention score can be detected per image. This method is not as flexible and powerful as CAM because
First, the feature learner is not the deep neural network but simple linear layers, so the features may not be representative and robust.
Second, the region proposals are highly restricted by the conventional approach, so the final class activation map only contains a rectangle box to coarsely approximate the object.

Hide-and-seek
Kumar Singh, Krishna, and Yong Jae Lee. "Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization." Proceedings of the IEEE International Conference on Computer Vision. 2017.
http://openaccess.thecvf.com/content_ICCV_2017/papers/Singh_Hide-And-Seek_Forcing_a_ICCV_2017_paper.pdf

They also used CAM. Rather than modifying the CNN architecture, they instead modified the input image by hiding random patches from it. The underline explanation is to force the network to learn to focus on multiple relevant parts, instead of the most relevant part, of an object. I rephrase the contribution into data augmentation by injecting noise (block) to the input images and learn a more robust deep neural network. In terms of feature maps, the activated region is enlarged from previously only the most discriminative regions to several discriminative regions. I'm not sure if the proposed method may increase the number of false positive detections. Note that this work should be easy to reproduce. But I don't expect the results will be dramatically improved regarding a more generalizable deep neural network enhanced by noise data augmentation.

Two phase learning
Kim, Dahun, et al. "Two-phase learning for weakly supervised object localization." Proceedings of the IEEE International Conference on Computer Vision. 2017.
https://dgyoo.github.io/papers/iccv17.pdf

Limited to only the natural images and the object label is seen by the pre-trained model. Train only the second network, and merge the CAMs from two networks together in the inference time. Element-wise multiplication is used for constraining the intermediate feature maps in the 2nd network. The quality of the suppression mask from the 1st network is important --- if it's messed up, the 2nd network will be meaningless. Therefore, this approach is not suitable for medical applications with limited labeled data.

Adversarial erasing
Wei, Yunchao, et al. "Object region mining with adversarial erasing: A simple classification to semantic segmentation approach." IEEE CVPR. Vol. 1. No. 2. 2017.
http://openaccess.thecvf.com/content_cvpr_2017/papers/Wei_Object_Region_Mining_CVPR_2017_paper.pdf

With adversarial erasing (AE), a classification network first mines the most discriminative region for image category label “dog”. Then, AE erases the mined region (head) from the image and the classification network is re-trained to discover a new object region (body) for performing classification without a performance drop. We repeat such adversarial erasing process for multiple times and merge the erased regions into an integral foreground segmentation mask. Repeating such adversarial erasing can localize increasingly discriminative regions diagnostic for image category until no more informative region left.
How to recognize the ending point when "no more informative region left"? The algorithm denotes that while (training of classification is success) do.

Adversarial Complementary Learning (ACoL)
Zhang, Xiaolin, et al. "Adversarial complementary learning for weakly supervised object localization." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
http://openaccess.thecvf.com/content_cvpr_2018/papers/Zhang_Adversarial_Complementary_Learning_CVPR_2018_paper.pdf

Adversarial erasing (AE) trains three networks independently for adversarial erasing. ACoL trains two adversarial branches jointly by integrating them into a single network. Second, AE adopts a recursive method to generate localization maps, and it has to forward the networks multiple times.

Multiple-instance learning (Maxpooling)
Li, Zhe, et al. "Thoracic disease identification and localization with limited supervision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
http://openaccess.thecvf.com/content_cvpr_2018/papers/Li_Thoracic_Disease_Identification_CVPR_2018_paper.pdf

Fair Applications in Medical Image Analysis

ChestX-ray8
Wang, Xiaosong, et al. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017.
http://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf

Diabetic retinopathy lesions in retinal fundus images
Gondal, Waleed M., et al. "Weakly-supervised localization of diabetic retinopathy lesions in retinal fundus images." 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017.
https://arxiv.org/pdf/1706.09634.pdf

It can handle multiple object detection appearing in one image.

Placental Ultrasound Images with Residual Networks
Qi, Huan, Sally Collins, and Alison Noble. "Weakly supervised learning of placental ultrasound images with residual networks." Annual Conference on Medical Image Understanding and Analysis. Springer, Cham, 2017.
Weakly supervised learning of placental ultrasound images with residual networks

Combination of typically observed in a placental image, namely (1) placenta only (PL); (2) placenta and myometrium (PL+MY); (3) placenta and subcutaneous tissue (PL+ST); (4) placenta, myometrium and subcutaneous tissue (PL+MY+ST). This is achieved by incorporating a global average pooling (GAP) layer before the fully connected layer.

Iterative saliency map refinement
González-Gonzalo, Cristina, et al. "Improving weakly-supervised lesion localization with iterative saliency map refinement." (MIDL 2018).
https://openreview.net/pdf?id=r15c8gnoG

An interesting approach to reveal discriminative image regions by inpainting based on previous CAM, a slight improvement between the final accuracy and initial accuracy. Note that the improvement is not significant and application-wise.

Proximal Femur Fractures
Jiménez-Sánchez, Amelia, et al. "Weakly-Supervised Localization and Classification of Proximal Femur Fractures." arXiv preprint arXiv:1809.10692 (2018).
https://arxiv.org/pdf/1809.10692.pdf

This paper investigated and adapted Spatial Transformers (ST), Self-Transfer Learning (STL),
and localization from global pooling layers (CAM), involving with / without localization, and with supervised / weakly-supervised localization. (a) and (f) are the lower- and upper- bound references, respectively. (b) requires supervised training for localization network. (c), (d), and (e) are weakly supervised object localization. Their experimental results show that self-transfer learning (STL) guides feature activations and boost performance when a larger number of labels in the dataset (6 classes), lower performance when binary classification. Different pooling layers are investigated, and as expected, global average pooling is confirmed as the best one. Also, CAM converges faster than the compared methods (Attention and STL), as composed of a single network.

Weakly Supervised Object Localization

Weakly supervised object localization refers to learning object locations from an image relied only on image-level annotation.

Overview

Novel Technical Approaches in Computer Vision

Fair Applications in Medical Image Analysis

推荐阅读更多精彩内容