ATPrompt: Advancing Textual Prompt Learning with Anchored Attributes

Zheng Li¹, Yibing Song², Ming-Ming Cheng¹, Xiang Li ^1*, Jian Yang^1*

¹ Nankai University, ²DAMO Academy, Alibaba Group
ICCV 2025
^*Indicates Corresponding Author
zhengli97[at]qq.com

fail — *Table 1. Additional attribute information can facilitate the recognition of unknown categories.*

When people are faced with something of an unknown category, they often recognize it by combining relevant detailed attributes (e.g., color, shape) to increase clarity and comprehensibility.

Attributes can serve as bridges that connect unknown categories to our known knowledge.

Inspired by this principle, can we let soft prompts learn attribute-related representations to improve the model's generalization ability?

Of Course! 😎

Framework

We present ATPrompt, an attribute-anchored prompt learning method for VLMs.

In this work, we propose to embed multiple fixed attribute tokens into the set of soft tokens,
transforming the original form Tab.2(a) into an attribute-class mixed form Tab.2(b) for prompt learning.

Principle

By embedding multiple fixed universal attribute tokens into the learnable soft prompts, our method extends the learning space of soft prompts from the original one-dimensional category level (Tab.3(a)) to the multi-dimensional attribute level (Tab.3(b)).

Guided by these attributes, soft prompts acquire not only category-specific but also attribute-related general representations during training, thereby enhancing the alignment between images and unknown categories compared to the original method.

Q: How to determine attributes?

A: Differentiable Attribute Search 🔎 !

Step 1: Form an attribute pool. We first query the LLM step by step to obtain multiple independent attributes, then combine them to create an attribute pool for subsequent search methods.

Step 2: Differentiable attribute search. For each attribute base in the pool, we propose the use of an alternating algorithm to jointly optimize soft tokens and path weights. After training, the attribute with the highest weight (confidence) is selected for targeted model training.

Contributions

(1). We introduce an attribute-anchored prompt learning method that expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into soft prompts.

(2). We introduce a differentiable attribute search method that learns to determine the appropriate attribute content and quantity for the dataset.

(3). Both shallow and deep versions of ATprompt are introduced to achieve compatibility with existing methods.

(4). ATPrompt can be seamlessly intergrated into existing textual-based methods and brings general improvement at a negligible computational cost.

Abstract

Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories.

In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-anchored Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form.

Additionally, we introduce a straightforward differentiable attribute search method to identify representative and suitable attributes for downstream tasks.

As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing basic prompt format in textual-based methods, providing general improvements at a negligible computational cost. Extensive experiments across 11 datasets validate the effectiveness of our method.

A Quick Overview of Experimental Results

Base-to-Novel Generalization

Cross Dataset Experiments

Domain Generalization

Comparison to Other Attributes

Here we explore the effectiveness of attributes derived through alternative methods, specifically by manually selecting class-irrelevant and common attributes.

The results indicate that manually selected irrelevant attributes exhibit comparable performance during training; however, they perform poorly when applied to new categories. This suggests that incorrect attribute tokens cause the soft tokens to develop biased representations, thereby diminishing their zero-shot generalization ability.

Attribute Order

In this study, we do not specifically focus on the order of attributes in ATPrompt because varying the sequence usually does not result in semantic deviations in reality. For example, phrases like “a yellow round leaf” and “a round yellow leaf” convey the same meaning.

From this table, we observe that despite variations in order, similar results are consistently produced, and the performance fluctuations across different orders remain within a reasonable range.

Prompt Operation of Deep Version

In ATPrompt-Deep, we exclusively drop class soft tokens while retaining both hard and soft attribute tokens after they pass through the block. In the following table, we compare the performance of partial drop (i.e., removing attribute soft tokens while retaining hard tokens) and full drop (i.e., removing both attribute soft and hard tokens) operations.

Attribute Bases and Searched Results

Other Useful Materials

1. If you are interested in prompt learning and want to know more about related work, we also maintain [a curated list of awesome prompt/adapter learning methods for VLMs] for your reference.

2. I was invited by Jiangmen(将门) to give a talk about prompt learning methods. In this video [Link], I introduce the motivation, principle, and related work of the prompt learning method in detail. If you can speak Chinese, this video might be good material to help you quickly understand the field of prompt learning.

3. Before this work, I published a paper on prompt learning at CVPR-2024 called PromptKD. In this project, I open-sourced the complete code and wrote a detailed Chinese paper interpretation. This interpretation is also a good learning material for your reference.

4. If you have any questions, please feel free to submit an issue on GitHub, or contact me by email (zhengli97[at]qq.com).

BibTeX

If you find our paper is helpful for your research, please consider citing our paper.

@article{li2024advancing,
    title={Advancing Textual Prompt Learning with Anchored Attributes},
    author={Li, Zheng and Song, Yibing and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
    journal={arXiv preprint arXiv:2412.09442},
    year={2024}
}
@inproceedings{li2024promptkd,
    title={Promptkd: Unsupervised prompt distillation for vision-language models},
    author={Li, Zheng and Li, Xiang and Fu, Xinyi and Zhang, Xin and Wang, Weiqiang and Chen, Shuo and Yang, Jian},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    pages={26617--26626},
    year={2024}
}