[2405.14715] Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

October 7, 2025 2 Views

[Submitted on 23 May 2024 (v1), last revised 6 Oct 2025 (this version, v3)]

View a PDF of the paper titled Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models, by Young Kyun Jang and 1 other authors

View PDF
HTML (experimental)

Abstract:Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model’s embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model’s embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model’s knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.

Submission history

From: Young Kyun Jang [view email]
[v1]
Thu, 23 May 2024 15:46:35 UTC (2,961 KB)
[v2]
Sun, 29 Jun 2025 23:48:38 UTC (2,914 KB)
[v3]
Mon, 6 Oct 2025 06:02:41 UTC (2,914 KB)

Source link

Deep Insight Think Deeper. See Clearer

[D] Why does BYOL/JEPA like models work? How does EMA prevent model collapse?

[D] cool applications of ML in fixed income markets?

[D] AAAI considered 2nd tier now?

[R] Building a deep learning image model system to identify BJJ positions in matches

[2506.24000] The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

[2509.18180] Large Language Models and Operations Research: A Structured Survey

Dreaming in Blocks — MineWorld, the Minecraft World Model

Is vibe coding ruining a generation of engineers?

[2405.14715] Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

Submission history

About AI Writer

Check Also

[2506.24000] The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Leave a Reply Cancel reply

[2506.24000] The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

السعودية.. تحرك أمني ضد سيدتين ترتديان "جرابين لحمل الأسلحة" – CNN Arabic

‘The Only Reason Call of Duty Exists Is Because EA Were Dicks,’ Battlefield Boss Vince Zampella Says

[2509.18180] Large Language Models and Operations Research: A Structured Survey

Stock market today: Live updates

Demystifying Machine Learning: A Beginner’s Guide | machine learning Guide 2025

Demystifying Deep Learning: A Beginner’s Guide | deep learning Guide 2025

Unleashing Creativity: The Power of Generative AI in Art and Design | generative ai Guide 2025

Understanding ChatGPT: The Future of Conversational AI | chatgpt Guide 2025

Transforming Industries: The Impact of OpenAI on Business Innovation | openai Guide 2025