When people are faced with something of an unknown category, they often recognize it by combining relevant detailed attributes (e.g., color, shape) to increase clarity and comprehensibility.
Attributes can serve as bridges that connect unknown categories to our known knowledge.
Inspired by this principle, can we let soft prompts learn attribute-related representations to improve the model's generalization ability?
Of Course! 😎
We present ATPrompt, an attribute-embedded prompt learning method for VLMs.
In this work, we propose to embed multiple fixed attribute tokens into the set of soft tokens,
transforming the original form Tab.2(a)
into an attribute-class mixed form Tab.2(b) for prompt learning.
By embedding multiple fixed universal attribute
tokens into the learnable soft prompts, our method extends the learning space of soft prompts from the original
one-dimensional category level (Tab.3(a)) to the multi-dimensional attribute level (Tab.3(b)).
Guided by these attributes, soft prompts acquire not only category-specific but
also attribute-related general representations during training, thereby enhancing the alignment between images and
unknown categories compared to the original method.
Step 1: Form an attribute pool. We first query the LLM step by step to obtain multiple independent attributes, then combine them to create an attribute pool for subsequent search methods.
Step 2: Differentiable attribute search. For each attribute base in the pool, we propose the use of an alternating algorithm to jointly optimize soft tokens and path weights. After training, the attribute with the highest weight (confidence) is selected for targeted model training.
(1). We introduce an attribute-templated prompt learning method that expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into soft prompts.
(2). We introduce a differentiable attribute search method that learns to determine the appropriate attribute content and quantity for the dataset.
(3). Both shallow and deep versions of ATprompt are introduced to achieve compatibility with existing methods.
(4). ATPrompt can be seamlessly intergrated into existing textual-based methods and brings general improvement at a negligible computational cost.
Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories.
In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form.
To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model.
As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
Here we explore the effectiveness of attributes derived through alternative methods, specifically by manually selecting class-irrelevant and common attributes.
The results indicate that manually selected irrelevant attributes exhibit comparable performance during training; however, they perform poorly when applied to new categories. This suggests that incorrect attribute tokens cause the soft tokens to develop biased representations, thereby diminishing their zero-shot generalization ability.
In this study, we do not specifically focus on the order of attributes in ATPrompt because varying the sequence usually does not result in semantic deviations in reality. For example, phrases like “a yellow round leaf” and “a round yellow leaf” convey the same meaning.
From this table, we observe that despite variations in order, similar results are consistently produced, and the performance fluctuations across different orders remain within a reasonable range.
In ATPrompt-Deep, we exclusively drop class soft tokens while retaining both hard and soft attribute tokens after they pass through the block. In the following table, we compare the performance of partial drop (i.e., removing attribute soft tokens while retaining hard tokens) and full drop (i.e., removing both attribute soft and hard tokens) operations.
1. If you are interested in prompt learning and want to know more about related work, we also maintain [a curated list of awesome prompt/adapter learning methods for VLMs] for your reference.
2. In October 2024, I was invited by Jiangmen(将门) to give a talk about prompt learning methods. In this video [Link], I introduce the motivation, principle, and related work of the prompt learning method in detail.
If you can speak Chinese, this video might be good material to help you quickly understand the field of prompt learning.
3. Before this work, I published a paper on prompt learning at CVPR-2024 called PromptKD. In this [project], I open-sourced the complete code and wrote a detailed Chinese paper interpretation.
This interpretation is also a good learning material for your reference.
4. If you have any questions, please feel free to submit an issue on GitHub, or contact me by email (zhengli97[at]qq.com).
If you find our paper is helpful for your research, please consider citing our paper.