A Unified Sequence Interface for Vision Tasks
15 Jun 2022
https://arxiv.org/abs/2206.07669
Authors: Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey Hinton
The first three authors contributed equally
Abstract: While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.
虽然语言任务自然地表达在一个统一的建模框架中,即生成标记序列,但在计算机视觉中并非如此。因此,针对不同的视觉任务,不同的体系结构激增,功能缺失。在这项工作中,我们表明,如果按照共享的像素到序列接口来制定,那么一组不同的“核心”计算机视觉任务也可以统一。我们专注于四项任务,即目标检测、实例分割、关键点检测和图像字幕,所有这些任务都具有不同类型的输出,例如边界框或密集遮罩。尽管如此,通过将每个任务的输出表示为具有统一接口的离散标记序列,我们表明可以在所有这些任务上训练具有单一模型架构和损失函数的神经网络,而无需特定于任务的定制。为了解决特定的任务,我们使用一个简短的提示作为任务描述,序列输出根据提示进行调整,以便生成特定于任务的输出。我们表明,与成熟的任务特定模型相比,这样的模型可以实现有竞争力的性能。