While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities and shown potential to severe as general-purpose assistants, their abilities to solve multi-image instance-level visual-language problems warrant further exploration. To assess such unproven abilities of MLLMs, this paper suggests a practical open-ended visual grounding task termed multi-context visual grounding. The new task aims to localize instances of interest within multiple images according to a flexible text input. To facilitate this research, we meticulously construct a new dataset MC-Bench and benchmark the open-ended visual grounding capability of MLLM-based methods. MC-Bench features 2,000 high-quality manually annotated samples consisting of instance-level labeled image pairs and corresponding text prompts indicating the instances of interest within or absent from the images. The text prompts are roughly categorized into three types (i.e., referring, comparison and reasoning) and cover over 10 practical skills (e.g., multi-hop reasoning, common sense reasoning, multi-view reasoning, temporal understanding, etc.). Our evaluation on MC-Bench reveals a significant performance gap between human and existing MLLM-based approaches, especially for the end-to-end models. We hope MC-Bench can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs in instance-level tasks particularly in multi-image scenarios.