Weakly Supervised Object Localization

Author: Zongwei Zhou | 周纵苇
Weibo: @MrGiovanni
Email: zongweiz@asu.edu

Weakly supervised object localization refers to learning object locations from an image relied only on image-level annotation.

Overview

There have been a number of ongoing investigations regarding weakly supervised object localization relied only on image-level annotation, covering Self-Transfer Learning based (Hwang et al.), Attention based (Kazi et al.), and Multiple Instance Learning based (Li et al.) models. One of the most recognized approaches is class activation map (CAM), introduced by Zhou et al., which is produced from a trained Convolutional Neural Network (CNN) for classification. This localization ability is generic and encouraging number of medical applications in weakly supervised disease localization. For example, Wang et al. calculated the CAMs from the multi-class CNN and generated the bounding boxes for each pathology candidate in X-rays; Gondal et al. can handle multiple diabetic retinopathy lesions in one retinal fundus image by considering multiple binarized region proposals from CAM; Qi et al. achieved better localization performance by combining CAMs of four typically observed in a placental image together.

Despite CAM is promising in image-level detection, it may not be a proper approach for accurate object localization, which demands both rough location and approximate size of the object, as it only focuses on the most discriminative region of the object, ignoring the rest of the object. For example, when applying CAM to localize cats, it focuses on only one of the most discriminative areas such as face, body, or tail of the cat, whereas fails to outline the whole cat. That is, measured by detection metrics, CAM behaves appropriately while measured by such critical metrics as IoU and Dice, CAM performs merely fair. For this reason, many studies intend to improve the CAM towards solving a more ambitious task -- semantic segmentation relied only on image-level annotation.

In Teh et al., attention mechanism has been introduced to guide classifier learn a more discriminative object region using the traditional region proposal method, which is computational cost and time-consuming in practice. Further, CAM has been utilized as the new region proposal method and regularized as an attention mask to reveal more discriminative regions. Kim et al. proposed two-phase learning, using the CAM from the pre-trained CNN as suppression mask, and then training the second CNN by adding this suppression mask to the intermediate feature maps. Merge the two CAMs from both CNNs together to get a more accurate localization performance. This approach, however, is not suitable for the medical image because the class of the object has to appear in the pre-trained task. González-Gonzalo et al. slightly improved diabetic retinopathy lesions localization accuracy in an iterative manner by inpainting input image base on the previous predicted CAM. Wei et al. trained several CNNs independently for adversarial erasing (AE) and adopted a recursive manner to generate localization map until the classification CNN training is failed. Zhang et al. proposed an improved version of AE, named Adversarial Complementary Learning (ACoL) by integrating those independent CNNs into a single network and training it end-to-end. Nevertheless, ACoL can only approximate the same-quality maps as CAM, but in a more convenient way. Another recent attempt by Singh et al. was to encourage CNN to focus on multiple relevant parts of the object beyond just the most discriminative one, by randomly remove patches from the input image.

The success of aforementioned variants of CAM- and Attention- based approaches, in essence, shares two consequential presumptions. First, the accuracy of CNN should be promising to conduct a meaningful prediction for the classification task; in other words, if CNN performs poorly, due to some reasons such as the limited labeled data, this technology will be powerless in object localization. Second, built on a well-trained CNN and follow-up the outstanding results for object classification, the class activation map is supposed to activate the discriminative regions of the object. Unfortunately, it is only an intuitive assumption and experiments have demonstrated the effectiveness under most scenarios, but no solid theory proves the validity. The discriminative region is not always equivalent to the target object itself. Thereby, training a promising CNN in classification task is a necessary but not sufficient condition in utilizing CAM for weakly supervised object localization.


Novel Technical Approaches in Computer Vision

  • Is object localization for free?
    Oquab, Maxime, et al. "Is object localization for free?-weakly-supervised learning with convolutional neural networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
    https://leon.bottou.org/publications/pdf/cvpr-2015.pdf

It is probably the first to describe a weakly supervised object localization using CNNs. However, their localization is limited to a point lying in the boundary of the object rather than determining the full extent of the object, limited by the global max pooling. The max pooling rather than average pooling was used is because the task was formulated as a multiple instance learning (MIL) problem. Personally, I prefer this work more than Zhou et al., and especially admire their proposed question: is object localization with convolutional neural networks for free? The method is close enough to CAM, described as following:
First, we treat the last fully connected network layers as convolutions to cope with the uncertainty in object localization.
Second, we introduce a max-pooling layer that hypothesizes the possible location of the object in the image.
Third, we modify the cost function to learn from image-level supervision.

The major message of the paper very focuses: invite global average layer and shed light on how it explicitly enables CNN to have remarkable localization ability despite being trained on image-level labels. Though CAM works well in object localization, it has low precision, covering both relevant and non-relevant (noise activations and backgrounds) regions; it produces low-resolution maps, impeding precise object localization when the input image size is small.

It works like Multi-task Learning (MTL).

Edge boxes method extracts proposals (bounding boxes) that are likely to contain any object. Each proposal is passed to a linear layer to obtain its attention score. Then apply the softmax operation to the attention scores before multiplying it with its corresponding proposal features. This gives a whole image feature vector that is the weighted average of proposals. Finally, the whole image feature is used to classify the image.
This paper introduces proposal attention to implicitly locate the object by learning the contribution of each proposed region towards the final classification results. Similar with R-CNN, the region proposal approach (Edge boxes method), however, are adopted as external modules independent of the network, so the region locations are not adaptive with the expressive features learned by model training. Also, it cannot handle multiple objects in the image because only one proposal with the highest attention score can be detected per image. This method is not as flexible and powerful as CAM because
First, the feature learner is not the deep neural network but simple linear layers, so the features may not be representative and robust.
Second, the region proposals are highly restricted by the conventional approach, so the final class activation map only contains a rectangle box to coarsely approximate the object.

They also used CAM. Rather than modifying the CNN architecture, they instead modified the input image by hiding random patches from it. The underline explanation is to force the network to learn to focus on multiple relevant parts, instead of the most relevant part, of an object. I rephrase the contribution into data augmentation by injecting noise (block) to the input images and learn a more robust deep neural network. In terms of feature maps, the activated region is enlarged from previously only the most discriminative regions to several discriminative regions. I'm not sure if the proposed method may increase the number of false positive detections. Note that this work should be easy to reproduce. But I don't expect the results will be dramatically improved regarding a more generalizable deep neural network enhanced by noise data augmentation.

  • Two phase learning
    Kim, Dahun, et al. "Two-phase learning for weakly supervised object localization." Proceedings of the IEEE International Conference on Computer Vision. 2017.
    https://dgyoo.github.io/papers/iccv17.pdf

Limited to only the natural images and the object label is seen by the pre-trained model. Train only the second network, and merge the CAMs from two networks together in the inference time. Element-wise multiplication is used for constraining the intermediate feature maps in the 2nd network. The quality of the suppression mask from the 1st network is important --- if it's messed up, the 2nd network will be meaningless. Therefore, this approach is not suitable for medical applications with limited labeled data.

With adversarial erasing (AE), a classification network first mines the most discriminative region for image category label “dog”. Then, AE erases the mined region (head) from the image and the classification network is re-trained to discover a new object region (body) for performing classification without a performance drop. We repeat such adversarial erasing process for multiple times and merge the erased regions into an integral foreground segmentation mask. Repeating such adversarial erasing can localize increasingly discriminative regions diagnostic for image category until no more informative region left.
How to recognize the ending point when "no more informative region left"? The algorithm denotes that while (training of classification is success) do.

Adversarial erasing (AE) trains three networks independently for adversarial erasing. ACoL trains two adversarial branches jointly by integrating them into a single network. Second, AE adopts a recursive method to generate localization maps, and it has to forward the networks multiple times.


Fair Applications in Medical Image Analysis

  • Diabetic retinopathy lesions in retinal fundus images
    Gondal, Waleed M., et al. "Weakly-supervised localization of diabetic retinopathy lesions in retinal fundus images." 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017.
    https://arxiv.org/pdf/1706.09634.pdf

It can handle multiple object detection appearing in one image.

Combination of typically observed in a placental image, namely (1) placenta only (PL); (2) placenta and myometrium (PL+MY); (3) placenta and subcutaneous tissue (PL+ST); (4) placenta, myometrium and subcutaneous tissue (PL+MY+ST). This is achieved by incorporating a global average pooling (GAP) layer before the fully connected layer.

  • Iterative saliency map refinement
    González-Gonzalo, Cristina, et al. "Improving weakly-supervised lesion localization with iterative saliency map refinement." (MIDL 2018).
    https://openreview.net/pdf?id=r15c8gnoG

An interesting approach to reveal discriminative image regions by inpainting based on previous CAM, a slight improvement between the final accuracy and initial accuracy. Note that the improvement is not significant and application-wise.

  • Proximal Femur Fractures
    Jiménez-Sánchez, Amelia, et al. "Weakly-Supervised Localization and Classification of Proximal Femur Fractures." arXiv preprint arXiv:1809.10692 (2018).
    https://arxiv.org/pdf/1809.10692.pdf

This paper investigated and adapted Spatial Transformers (ST), Self-Transfer Learning (STL),
and localization from global pooling layers (CAM), involving with / without localization, and with supervised / weakly-supervised localization. (a) and (f) are the lower- and upper- bound references, respectively. (b) requires supervised training for localization network. (c), (d), and (e) are weakly supervised object localization. Their experimental results show that self-transfer learning (STL) guides feature activations and boost performance when a larger number of labels in the dataset (6 classes), lower performance when binary classification. Different pooling layers are investigated, and as expected, global average pooling is confirmed as the best one. Also, CAM converges faster than the compared methods (Attention and STL), as composed of a single network.

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,098评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,213评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,960评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,519评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,512评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,533评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,914评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,574评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,804评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,563评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,644评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,350评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,933评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,908评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,146评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,847评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,361评论 2 342

推荐阅读更多精彩内容

  • 2009年4月21日 读旷野新客《雨游西师》,竟然勾起我对母校的回忆来,这些年来刻意拒绝,仿佛那地方与我从...
    明月劫阅读 219评论 0 0
  • 291028-魏鸿超《2017-1-22》 【连续14+10天总结】 A、今日目标完成情况 1、 面霸 完成30%...
    a0001911cc5a阅读 278评论 0 0
  • 南栀半暖阅读 255评论 0 3