CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering

Abstract

Recent Vision-Language Models (VLMs) have demonstrated remarkable capabilities in visual understanding and reasoning, and in particular on multiple-choice Visual Question Answering (VQA). Still, these models can make distinctly unnatural errors, for example, providing (wrong) answers to unanswerable VQA questions, such as questions asking about objects that do not appear in the image.

To address this issue, we propose CLIP-UP: CLIP-based Unanswerable Problem detection, a novel lightweight method for equipping VLMs with the ability to withhold answers to unanswerable questions. By leveraging CLIP to extract question-image alignment information, CLIP-UP requires only efficient training of a few additional layers, while keeping the original VLMs' weights unchanged.

Tested across LLaVA models, CLIP-UP achieves state-of-the-art results on the MM-UPD benchmark for assessing unanswerability in multiple-choice VQA, while preserving the original performance on other tasks.

What does CLIP-UP do?

CLIP-UP equips pre-trained VLMs with the ability to detect and withhold answers to unanswerable questions, while still preserving the original capabilities on standard answerable questions.

Three categories of multiple-choice VQA unanswerability are considered: (1) Absent Answer Detection (AAD) – questions where all answer options are incorrect; (2) Incompatible Answer Set Detection (IASD) – questions with answer options incompatible with the question; and (3) Incompatible Visual Question Detection (IVQD) – questions where the question is incompatible with the image.

In the example below, LLaVA-1.5-7B correctly answers a standard answerable question but fails to withold answers for the three unanswerable questions cases. CLIP-UP enhances LLaVA-1.5-7B by equipping it with the ability to withhold answers to unanswerable questions, while preserving its ability to respond to standard ones.

How does CLIP-UP work?

Idea

CLIP-UP operates by leveraging CLIP to generate correlation vectors that capture alignment information between the input image and the multiple-choice question, as multiple-choice VQA (un)answerability relies on this alignment. These vectors are projected into the VLM’s intermediate feature space, producing a new embedding vector is seamlessly integrated into the VLM and enables it to withhold answers to unanswerable questions.

Architecture

The figure below illustrates CLIP-UP applied on LLaVA architecture: Given an image and multiple-choice VQA prompt, the prompt is transformed into text segments merging the question with each answer option. These segments and the image are encoded by CLIP (specifically by Structure-CLIP) to produce embeddings, from which correlation vectors are formed via element-wise multiplication. A learnable projection module maps the correlation vectors into LLaVA's intermediate feature space. The resulting new embedding vector is integrated into the LLM alongside the standard inputs. The process does not alter LLaVA's original weights.

CLIP-UP with Mixture of Experts

To address all unanswerability categories, a Mixture of Experts (MoE) approach is used. We learn a gating module that combines embedding vectors generated from pre-trained "expert projections" for each of the three unanswerability categories.

Acknowledgements

We thank Miyai et al. for their work on Unsolvable Problem Detection and the MM-UPD benchmark, which supported the development and evaluation of CLIP-UP.

BibTeX

If you find our work useful, please cite:

@misc{vardi2025clipupclipbasedunanswerableproblem,
      title={CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering}, 
      author={Ben Vardi and Oron Nir and Ariel Shamir},
      year={2025},
      eprint={2501.01371},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.01371}, 
}