MC-Bench

Abstract

While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. In order to facilitate this research, we construct a new dataset MC-Bench that features 2K high-quality and manually annotated samples. Each sample consists of an instance-level labeled image pair and a corresponding text prompt that indicates the target instances in the images. These text prompts are highly open-ended and grouped into three distinct styles, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities, as well as a simple yet effective stepwise baseline and a finetuned baseline by multi-context instruction tuning. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans, along with some interesting observations that suggest potential future directions. We hope MC-Bench and our empirical findings can encourage the research community to further explore and enhance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts.

MC-Bench

Diverse samples covering 20 practical skills

Statistical analysis

Experiments

Benchmark results

More analysis

BibTeX

@article{xu2024mc,
    title={MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs},
    author={Xu, Yunqiu and Zhu, Linchao and Yang, Yi},
    journal={arXiv preprint arXiv:2410.12332},
    year={2024}
}

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Multi-context visual grounding is a new task that aims at localizing instances based on open-ended text prompts in multi-image scenarios. A new dataset MC-Bench is constructed to benchmark the MLLMs and foundation models with potential multi-context visual grounding capabilities.

Abstract

MC-Bench

Diverse samples covering 20 practical skills

Statistical analysis

Experiments

Benchmark results

More analysis

BibTeX