Wang, Zhixiang and Liu, Junzhuo and Zhang, Ye and Chen, Qiaosong and Qian, Linxue (2026) A cross modal fine-grained retrieval method based on LAGC and contrastive learning. EXPERT SYSTEMS WITH APPLICATIONS, 297 (Part C): 129451. ISSN 0957-4174, 1873-6793
Full text not available from this repository. (Request a copy)Abstract
The goal of cross-modal retrieval is for users to input any sample as a query and for the system to retrieve and provide various modal samples relevant to the query sample. While recent research has begun exploring finegrained cross-modal retrieval, most existing methods are still constrained to limited modality pairs (e.g., image-text) and fail to fully address the challenges arising from multi-modal fine-grained retrieval. Fine-grained retrieval faces many difficulties, such as inherent heterogeneity and semantic gaps between multi-modal data, limited fine-grained datasets, challenges in measuring fine-grained data similarity, and minimal differences between sample features. These challenges are amplified in scenarios involving four or more modalities. Although local-global attention mechanisms have been widely adopted in previous studies, most existing methods either focus only on a limited set of modality pairs (e.g., image-text) or fail to fully exploit the complementarity between local and global cues in scenarios involving four or more modalities. In addition, the use of heterogeneous encoder architectures further constrains the efficiency of feature fusion. To address these limitations, this paper first employs a unified encoder architecture for different modalities and proposes the Local and Global Cross (LAGC) attention module to achieve more efficient feature extraction for fine-grained samples. This module utilizes local and global cross-attention to fully exploit the local and global features of fine-grained samples, extract fine-grained semantic information in their high-dimensional vector space, and greatly enhance the expression ability of modality-specific features in the process of single-modal feature extraction. Meanwhile, to fully integrate different modal features and achieve more effective feature interaction, the Multi Modal Cross (MMC) module is specifically designed. This module links the unique information of each modality with the commonality between different modal data, thereby enhancing the model's expression ability in the common space. Finally, adopting the idea of contrastive learning, during the training process, the fused features are aligned semantically, greatly enhancing the cross-modal retrieval capability of fine-grained samples. Finally, through extensive experiments and ablation studies, the proposed method has been demonstrated to achieve competitive results on the public datasets PKU FG-XMedia, NUS-Wide, and MIRFlickr-25k.
| Item Type: | Article |
|---|---|
| Uncontrolled Keywords: | REPRESENTATION; Cross-modal retrieval; Fine-grained retrieval; Attention mechanism; Contrastive learning; Semantic complementarity |
| Subjects: | 000 Computer science, information & general works > 004 Computer science |
| Divisions: | Informatics and Data Science |
| Depositing User: | Dr. Gernot Deinzer |
| Date Deposited: | 07 May 2026 06:46 |
| Last Modified: | 07 May 2026 06:46 |
| URI: | https://pred.uni-regensburg.de/id/eprint/67224 |
Actions (login required)
![]() |
View Item |

