ATPrompt: Textual Prompt Learning with Embedded Attributes

1 VCIP, College of Computer Science, Nankai University,
2DAMO Academy, Alibaba Group
arXiv:2412.09442
*Indicates Corresponding Author

zhengli97[at]mail.nankai.edu.com

Contributions

(1). We introduce an attribute-templated prompt learning method that expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into soft prompts.

(2). We introduce a differentiable attribute search method that learns to determine the appropriate attribute content and quantity for the dataset.

(3). Both shallow and deep versions of ATprompt are introduced to achieve compatibility with existing methods.

(4). ATPrompt can be seamlessly intergrated into existing textual-based methods and brings general improvement at a negligible computational cost.

Abstract

Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories.

In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form.

To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model.

As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

A Quick Overview of Experimental Results

Base-to-Novel Generalization
fail
Table 1. Base-to-novel generalization experiments of five baselines with and without our ATPrompt on 11 recognition datasets. HM: Harmonic Mean. ∆: HM improvement of ATPrompt over previous results. “ATPrompt” is abbreviated as “ATP”. Our method achieves consistent average performance improvement over different baselines.

Cross Dataset Experiments
fail
Table 2. Cross-dataset generalization experiments of three baselines with and without our ATPrompt on 11 datasets. Our method achieves consistent average performance improvements over three baseline methods.

Domain Generalization
fail
Table 3. Domain generalization experiments of three baselines with and without our ATPrompt on 4 datasets. Our method achieves consistent average performance improvement over three baseline methods.

Comparison to Other Attributes

Here we explore the effectiveness of attributes derived through alternative methods, specifically by manually selecting class-irrelevant and common attributes.

fail
Table 4. Comparison of different attributes on Food101. The attributes obtained by our method achieve the best performance.

The results indicate that manually selected irrelevant attributes exhibit comparable performance during training; however, they perform poorly when applied to new categories. This suggests that incorrect attribute tokens cause the soft tokens to develop biased representations, thereby diminishing their zero-shot generalization ability.


Attribute Order

In this study, we do not specifically focus on the order of attributes in ATPrompt because varying the sequence usually does not result in semantic deviations in reality. For example, phrases like “a yellow round leaf” and “a round yellow leaf” convey the same meaning.

fail
Table 5. Comparison of different attribute orders on ImageNet. The order of attributes does not significantly affect the model, and performance fluctuations are within a reasonable range.

From this table, we observe that despite variations in order, similar results are consistently produced, and the performance fluctuations across different orders remain within a reasonable range.


Prompt Operation of Deep Version

In ATPrompt-Deep, we exclusively drop class soft tokens while retaining both hard and soft attribute tokens after they pass through the block. In the following table, we compare the performance of partial drop (i.e., removing attribute soft tokens while retaining hard tokens) and full drop (i.e., removing both attribute soft and hard tokens) operations.

fail
Table 6. Comparison of operations on deep soft and hard attribute tokens based on MaPLe+ATPrompt. Preserving hard and soft attribute tokens in deep layers performs better than other operations.

Attribute Bases and Searched Results
fail
Table 7. Attribute bases and searched results for each dataset.

Other Useful Materials

1. If you are interested in prompt learning and want to know more about related work, we also maintain [a curated list of awesome prompt/adapter learning methods for VLMs] for your reference.

2. In October 2024, I was invited by Jiangmen(将门) to give a talk about prompt learning methods. In this video [Link], I introduce the motivation, principle, and related work of the prompt learning method in detail. If you can speak Chinese, this video might be good material to help you quickly understand the field of prompt learning.

3. Before this work, I published a paper on prompt learning at CVPR-2024 called PromptKD. In this [project], I open-sourced the complete code and wrote a detailed Chinese paper interpretation. This interpretation is also a good learning material for your reference.

4. If you have any questions, please feel free to submit an issue on GitHub, or contact me by email (zhengli97[at]qq.com).

BibTeX

If you find our paper is helpful for your research, please consider citing our paper.


@article{li2024atprompt,
      title={ATPrompt: Textual Prompt Learning with Embedded Attributes},
      author={Li, Zheng and Song, Yibing and Zhao, Penghai and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
      journal={arXiv preprint arXiv:2412.09442},
      year={2024}
    }