Vision-Language Models (VLMs) demonstrate remarkable capabilities in visual understanding and reasoning, such as in Visual Question Answering (VQA), where the model is asked a question related to a visual input. Still, these models can make distinctly unnatural errors, for example, providing (wrong) answers to unanswerable VQA questions, such as questions asking about objects that do not appear in the image.
To address this issue, we propose CLIP-UP: CLIP-based Unanswerable Problem detection, a novel lightweight method for equipping VLMs with the ability to withhold answers to unanswerable questions. CLIP-UP leverages CLIP-based similarity measures to extract question-image alignment information to detect unanswerability, requiring efficient training of only a few additional layers, while keeping the original VLMs' weights unchanged.
Tested across several models, CLIP-UP achieves significant improvements on benchmarks assessing unanswerability in both multiple-choice and open-ended VQA, surpassing other methods, while preserving original performance on other tasks.
CLIP-UP enhances pre-trained VLMs with the ability to detect and withhold answers to multiple-choice and open-ended unanswerable questions, while preserving models' original capabilities on standard answerable questions.
In the example below, LLaVA-1.5-7B correctly answers a standard multiple-choice answerable question but fails to withhold answers for three types of multiple-choice unanswerable questions. Similarly, for open-ended questions, LLaVA-1.5-7B correctly answers an answerable question, but fails to abstain from answering an unanswerable one.
CLIP-UP enhances LLaVA-1.5-7B by making it withhold answers to unanswerable questions, while still correctly answering standard ones.
The core idea of CLIP-UP is to enhance VLMs' abstention ability by injecting an answerability prior into the VLM's decision process. CLIP-UP does so by leveraging correlation vectors derived from CLIP embeddings of the input image and question, encoding image-question alignment information. These vectors are projected into the VLM's intermediate feature space, producing a new embedding vector that serves as an answerability prior and is seamlessly integrated into the model.
The figure below illustrates CLIP-UP applied to multiple-choice questions: given an image and a VQA prompt, the prompt is transformed into text segments merging the question with each answer option. These segments and the image are encoded by CLIP (specifically by Structure-CLIP) to produce embeddings, from which correlation vectors are formed via element-wise multiplication. A learnable projection module maps these vectors into the VLM's intermediate feature space. The resulting new embedding vector is integrated into the LLM component of the VLM alongside the standard inputs. The process does not alter the VLM's original weights.
The figure below shows the process for open-ended questions: the flow is similar but involves a single correlation vector computed from the image and question.
We propose an additional injection approach, beyond generating an embedding vector fed into the VLM. In particular, we propose Injected LoRA, a novel approach for injecting priors into LoRA layers, where the prior in our case is the correlation vectors. We show that this way of using CLIP-UP correlation vectors improves results over standard LoRA. Please see the paper for more details.
@misc{vardi2025clipupclipbasedunanswerableproblem,
title={CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering},
author={Ben Vardi and Oron Nir and Ariel Shamir},
year={2025},
eprint={2501.01371},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.01371},
}