CNN、Transformer、MLPs谁更鲁棒

https://arxiv.org/search/?query=convolution+transformer+robust&searchtype=all&source=header

RobustART:Benchmarking Robustness on Architecture Desgin and Traning Tecniques

Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs

★★★★★

Authors: Philipp Benz, Soomin Ham, Chaoning Zhang, Adil Karjauv, In So Kweon

Abstract: Convolutional Neural Networks (CNNs) have become the de facto gold standard in computer vision applications in the past years. Recently, however, new model architectures have been proposed challenging the status quo. The Vision Transformer (ViT) relies solely on attention modules, while the MLP-Mixer architecture substitutes the self-attention modules with Multi-Layer Perceptrons (MLPs). Despite their great success, CNNs have been widely known to be vulnerable to adversarial attacks, causing serious concerns for security-sensitive applications. Thus, it is critical for the community to know whether the newly proposed ViT and MLP-Mixer are also vulnerable to adversarial attacks. To this end, we empirically evaluate their adversarial robustness under several adversarial attack setups and benchmark them against the widely used CNNs. Overall, we find that the two architectures, especially ViT, are more robust than their CNN models. Using a toy example, we also provide empirical evidence that the lower adversarial robustness of CNNs can be partially attributed to their shift-invariant property. Our frequency analysis suggests that the most robust ViT architectures tend to rely more on low-frequency features compared with CNNs. Additionally, we have an intriguing finding that MLP-Mixer is extremely vulnerable to universal adversarial perturbations. △ Less

Submitted 11 October, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: Code: https://github.com/phibenz/robustness_comparison_vit_mlp-mixer_cnn

在过去的几年里，卷积神经网络（CNN）已经成为计算机视觉应用中事实上的黄金标准。然而，最近有人提出了挑战现状的新模型体系结构。视觉变换器（ViT）仅依赖于注意模块，而MLP混频器体系结构用多层感知器（MLP）替代自注意模块。尽管CNN取得了巨大的成功，但众所周知，CNN容易受到敌对攻击，这给安全敏感应用程序带来了严重的问题。因此，社区必须了解新提议的ViT和MLP混音器是否也容易受到敌对攻击。为此，我们以经验评估了它们在几种对抗攻击设置下的对抗鲁棒性，并针对广泛使用的CNN对其进行了基准测试。总的来说，我们发现这两种体系结构，尤其是ViT，比它们的CNN模型更健壮。通过一个玩具例子，我们还提供了经验证据，证明CNN较低的对抗鲁棒性部分归因于其平移不变特性。我们的频率分析表明，与CNN相比，最稳健的ViT架构更倾向于依赖低频特性。此外，我们有一个有趣的发现，MLP混频器极易受到普遍的敌对干扰。

arXiv:2106.13122 [pdf, other]

Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers

Authors: Katelyn Morrison, Benjamin Gilby, Colton Lipchak, Adam Mattioli, Adriana Kovashka

Abstract: Recently, vision transformers and MLP-based models have been developed in order to address some of the prevalent weaknesses in convolutional neural networks. Due to the novelty of transformers being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are robust to corruptions. Despite some works proposing that data augmentation remains essential for a model to be robust against corruptions, we propose to explore the impact that the architecture has on corruption robustness. We find that vision transformer architectures are inherently more robust to corruptions than the ResNet-50 and MLP-Mixers. We also find that vision transformers with 5 times fewer parameters than a ResNet-50 have more shape bias. Our code is available to reproduce. △ Less

Submitted 3 July, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

Comments: Under review at the Uncertainty and Robustness in Deep Learning workshop at ICML 2021. Our appendix is attached to the last page of the paper

摘要：最近，为了解决卷积神经网络中普遍存在的一些弱点，人们开发了视觉变换器和基于MLP的模型。由于变压器在这一领域的应用以及自我关注机制的新颖性，目前尚不清楚这些体系结构在多大程度上对腐蚀具有鲁棒性。尽管有一些工作建议数据扩充对于模型抗损坏的健壮性仍然至关重要，但我们建议探索体系结构对损坏健壮性的影响。我们发现，vision transformer体系结构天生比ResNet-50和MLP混频器更能抵抗损坏。我们还发现，参数比ResNet-50少5倍的视觉变压器具有更多的形状偏差。我们的代码可以复制。

★★★★★

arXiv:2105.10497 [pdf, other]

Intriguing Properties of Vision Transformers

★★★★★

Authors: Muzammal Naseer, Kanchana Ranasinghe, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang

Abstract: Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility in attending image-wide context conditioned on a given patch can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robust performance to occlusions is not due to a bias towards local textures, and ViTs are significantly less biased towards textures compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via the self-attention mechanism. △ Less

Submitted 8 June, 2021; v1 submitted 21 May, 2021; originally announced May 2021.

Comments: Code: https://git.io/Js15X

摘要：视觉转换器（ViT）在各种机器视觉问题上表现出令人印象深刻的性能。这些模型基于多头自我注意机制，可以灵活地处理一系列图像块，对上下文线索进行编码。一个重要的问题是，以给定补丁为条件处理图像范围上下文的灵活性如何有助于处理自然图像中的干扰，例如严重遮挡、域移动、空间排列、敌对和自然干扰。我们通过一系列广泛的实验系统地研究了这个问题，包括三个ViT家族，并与高性能卷积神经网络（CNN）进行了比较。我们展示并分析了ViT的以下有趣特性：（a）变压器对严重遮挡、扰动和域移动具有高度鲁棒性，例如，即使在随机遮挡80%的图像内容后，在ImageNet上仍保持高达60%的top-1精度。（b）对遮挡的鲁棒性能不是由于对局部纹理的偏见，与CNN相比，VIT对纹理的偏见要小得多。当适当训练以编码基于形状的特征时，VIT显示出与人类视觉系统相当的形状识别能力，这在以前的文献中是无与伦比的。（c）使用VIT对形状表示进行编码，可以在没有像素级监控的情况下实现精确的语义分割。（d）来自单个ViT模型的现成特征可以组合起来创建一个特征集合，从而在传统和少数镜头学习范式中，在一系列分类数据集中实现高准确率。我们发现ViTs的有效特征是通过自我注意机制可能产生的灵活和动态的感受野。

arXiv:2105.07926 [pdf, other]

Towards Robust Vision Transformer

Authors: Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, Hui Xue

Abstract: Recent advances on Vision Transformer (ViT) and its improved variants have shown that self-attention-based networks surpass traditional Convolutional Neural Networks (CNNs) in most vision tasks. However, existing ViTs focus on the standard accuracy and computation cost, lacking the investigation of the intrinsic influence on model robustness and generalization. In this work, we conduct systematic evaluation on components of ViTs in terms of their impact on robustness to adversarial examples, common corruptions and distribution shifts. We find some components can be harmful to robustness. By using and combining robust components as building blocks of ViTs, we propose Robust Vision Transformer (RVT), which is a new vision transformer and has superior performance with strong robustness. We further propose two new plug-and-play techniques called position-aware attention scaling and patch-wise augmentation to augment our RVT, which we abbreviate as RVT*. The experimental results on ImageNet and six robustness benchmarks show the advanced robustness and generalization ability of RVT compared with previous ViTs and state-of-the-art CNNs. Furthermore, RVT-S* also achieves Top-1 rank on multiple robustness leaderboards including ImageNet-C and ImageNet-Sketch. The code will be available at \url{https://git.io/Jswdk}. △ Less

Submitted 26 May, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

摘要：视觉变换器（ViT）及其改进型的最新进展表明，基于自我注意的网络在大多数视觉任务中超过了传统的卷积神经网络（CNN）。然而，现有的ViTs主要关注标准精度和计算成本，缺乏对模型鲁棒性和泛化的内在影响的研究。在这项工作中，我们对VIT的组成部分进行了系统评估，评估其对对抗性示例、常见腐蚀和分布变化的鲁棒性的影响。我们发现一些组件可能对健壮性有害。通过使用和组合鲁棒组件作为ViTs的构建块，我们提出了鲁棒视觉转换器（RVT），它是一种新的视觉转换器，具有强大的鲁棒性和优越的性能。我们进一步提出了两种新的即插即用技术，称为位置感知注意缩放和面片增强，以增强我们的RVT，简称RVT*。在ImageNet和六个鲁棒性基准上的实验结果表明，与以前的ViTs和最先进的CNN相比，RVT具有更高的鲁棒性和泛化能力。此外，RVT-S*在包括ImageNet-C和ImageNet Sketch在内的多个稳健性排行榜上也获得了第一名。该代码将在\url中提供{https://git.io/Jswdk}.

arXiv:2105.07581 [pdf, other]

Vision Transformers are Robust Learners

★★★★★

Authors: Sayak Paul, Pin-Yu Chen

Abstract: Transformers, composed of multiple self-attention layers, hold strong promises toward a generic learning primitive applicable to different data modalities, including the recent breakthroughs in computer vision achieving state-of-the-art (SOTA) standard accuracy with better parameter efficiency. Since self-attention helps a model systematically align different components present inside the input data, it leaves grounds to investigate its performance under model robustness benchmarks. In this work, we study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We use six different diverse ImageNet datasets concerning robust classification to conduct a comprehensive performance comparison of ViT models and SOTA convolutional neural networks (CNNs), Big-Transfer. Through a series of six systematically designed experiments, we then present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners. For example, with fewer parameters and similar dataset and pre-training combinations, ViT gives a top-1 accuracy of 28.10% on ImageNet-A which is 4.3x higher than a comparable variant of BiT. Our analyses on image masking, Fourier spectrum sensitivity, and spread on discrete cosine energy spectrum reveal intriguing properties of ViT attributing to improved robustness. Code for reproducing our experiments is available here: https://git.io/J3VO0. △ Less

摘要：变形金刚由多个自我注意层组成，它有望成为一种适用于不同数据模式的通用学习原语，包括最近在计算机视觉领域取得的突破，以更好的参数效率实现最先进的（SOTA）标准精度。由于自我关注有助于模型系统地对齐输入数据中存在的不同组件，因此它为在模型鲁棒性基准下研究其性能留下了基础。在这项工作中，我们研究了视觉转换器（ViT）对常见的腐蚀和扰动、分布偏移和自然对抗示例的鲁棒性。我们使用六种不同的关于稳健分类的ImageNet数据集，对ViT模型和SOTA卷积神经网络（CNN）进行综合性能比较，即大传输。通过一系列六个系统设计的实验，我们提出了定量和定性的分析，以解释为什么VIT确实是更健壮的学习者。例如，在参数较少、数据集和预训练组合相似的情况下，ViT在ImageNet-a上的最高精度为28.10%，比BiT的可比变体高4.3倍。我们对图像掩蔽、傅里叶光谱灵敏度和离散余弦能谱扩展的分析揭示了ViT的有趣特性，归因于增强了鲁棒性。复制我们的实验的代码可在此处获得：https://git.io/J3VO0.

arXiv:2104.02610 [pdf, other]

On the Robustness of Vision Transformers to Adversarial Examples

Authors: Kaleel Mahmood, Rigel Mahmood, Marten van Dijk

Abstract: Recent advances in attention-based networks have shown that Vision Transformers can achieve state-of-the-art or near state-of-the-art results on many image classification tasks. This puts transformers in the unique position of being a promising alternative to traditional convolutional neural networks (CNNs). While CNNs have been carefully studied with respect to adversarial attacks, the same cannot be said of Vision Transformers. In this paper, we study the robustness of Vision Transformers to adversarial examples. Our analyses of transformer security is divided into three parts. First, we test the transformer under standard white-box and black-box attacks. Second, we study the transferability of adversarial examples between CNNs and transformers. We show that adversarial examples do not readily transfer between CNNs and transformers. Based on this finding, we analyze the security of a simple ensemble defense of CNNs and transformers. By creating a new attack, the self-attention blended gradient attack, we show that such an ensemble is not secure under a white-box adversary. However, under a black-box adversary, we show that an ensemble can achieve unprecedented robustness without sacrificing clean accuracy. Our analysis for this work is done using six types of white-box attacks and two types of black-box attacks. Our study encompasses multiple Vision Transformers, Big Transfer Models and CNN architectures trained on CIFAR-10, CIFAR-100 and ImageNet.

摘要：基于注意力的网络的最新进展表明，视觉变换器可以在许多图像分类任务中实现最先进或接近最先进的结果。这使变压器处于独特的地位，成为传统卷积神经网络（CNN）的一个有前途的替代品。虽然CNN在对抗性攻击方面进行了仔细的研究，但在视觉变形金刚方面却不是这样。在本文中，我们研究了视觉变换器对对抗性示例的鲁棒性。我们对变压器安全性的分析分为三个部分。首先，我们在标准的白盒和黑盒攻击下测试变压器。其次，我们研究了CNN和Transformer之间对抗性示例的可转移性。我们表明，对抗性示例不容易在CNN和变压器之间传输。基于这一发现，我们分析了CNN和变压器的简单集成防御的安全性。通过创建一种新的攻击，即自注意混合梯度攻击，我们证明了这样的集成在白盒对手下是不安全的。然而，在黑箱对手的情况下，我们证明了一个集成可以在不牺牲精确性的情况下实现前所未有的健壮性。我们使用六种类型的白盒攻击和两种类型的黑盒攻击来分析这项工作。我们的研究包括在CIFAR-10、CIFAR-100和ImageNet上训练的多个视觉转换器、大传输模型和CNN架构。

arXiv:2103.15670 [pdf, other]

cs.CV cs.AI cs.LG

On the Adversarial Robustness of Vision Transformers

★★★★★★★★★★★★★★★★★★★★★★★★★

Authors: Rulin Shao, Zhouxing Shi, Jinfeng Yi, Pin-Yu Chen, Cho-Jui Hsieh

Abstract: Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs). This observation also holds for certified robustness. We summarize the following main observations contributing to the improved robustness of ViTs: 1) Features learned by ViTs contain less low-level information and are more generalizable, which contributes to superior robustness against adversarial perturbations. 2) Introducing convolutional or tokens-to-token blocks for learning low-level features in ViTs can improve classification accuracy but at the cost of adversarial robustness. 3) Increasing the proportion of transformers in the model structure (when the model consists of both transformer and CNN blocks) leads to better robustness. But for a pure transformer model, simply increasing the size or adding layers cannot guarantee a similar effect. 4) Pre-training on larger datasets does not significantly improve adversarial robustness though it is critical for training ViTs. 5) Adversarial training is also applicable to ViT for training robust models. Furthermore, feature visualization and frequency analysis are conducted for explanation. The results show that ViTs are less sensitive to high-frequency perturbations than CNNs and there is a high correlation between how well the model learns low-level features and its robustness against different frequency-based perturbations. △ Less

Submitted 14 October, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

随着自然语言处理和理解的成功推进，《变形金刚》有望给计算机视觉带来革命性的变化。这项工作首次全面研究了视觉转换器（VIT）对对抗性干扰的鲁棒性。在各种白盒和转移攻击设置下进行测试，我们发现ViTs与卷积神经网络（CNN）相比具有更好的对抗鲁棒性。这一观察结果也适用于经认证的稳健性。我们总结了以下有助于提高ViTs鲁棒性的主要观察结果：1）ViTs学习的特征包含较少的低级信息，更具普遍性，这有助于提高对抗性干扰的鲁棒性。2）将卷积或令牌引入令牌块以学习ViTs中的低级特征可以提高分类精度，但代价是对抗性鲁棒性。3）增加模型结构中变压器的比例（当模型由变压器和CNN块组成时）可提高鲁棒性。但对于纯变压器模型，简单地增加尺寸或添加层并不能保证类似的效果。4）在较大数据集上进行预训练不会显著提高对抗鲁棒性，尽管这对于训练VIT至关重要。5）对抗性训练也适用于ViT，用于训练健壮的模型。此外，还进行了特征可视化和频率分析。结果表明，与CNN相比，VIT对高频扰动的敏感性较低，并且模型对低层特征的学习程度与其对不同频率扰动的鲁棒性之间存在高度相关性。

arXiv:2103.14586 [pdf, other]

Understanding Robustness of Transformers for Image Classification
★★★★★

Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit

Abstract: Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead one to wonder whether these networks are as robust. In this paper, we perform an extensive study of a variety of different measures of robustness of ViT models and compare the findings to ResNet baselines. We investigate robustness to input perturbations as well as robustness to model perturbations. We find that when pre-trained with a sufficient amount of data, ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations. We also find that Transformers are robust to the removal of almost any single layer, and that while activations from later layers are highly correlated with each other, they nevertheless play an important role in classification. △ Less

Submitted 8 October, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

Comments: Accepted for publication at ICCV 2021. Rewrote Section 5 and made other minor changes throughout

摘要：深卷积神经网络（CNN）长期以来一直是计算机视觉任务的首选结构。最近，基于转换器的体系结构（如ViT）在图像分类方面已经达到甚至超过了RESNET。然而，变压器体系结构的细节——比如使用非重叠补丁——让人怀疑这些网络是否同样健壮。在本文中，我们对ViT模型稳健性的各种不同度量进行了广泛的研究，并将研究结果与ResNet基线进行了比较。我们研究了对输入扰动的鲁棒性以及对模型扰动的鲁棒性。我们发现，当使用足够数量的数据进行预训练时，ViT模型在大范围的扰动下至少与ResNet模型一样稳健。我们还发现，变压器对几乎任何一层的移除都具有鲁棒性，虽然后一层的激活彼此高度相关，但它们在分类中起着重要作用。

CNN、Transformer、MLPs谁更鲁棒

推荐阅读更多精彩内容