Evaluation

Benchmarks for evaluating unified multimodal models — understanding, image generation, and interleaved tasks.

⌘K

Comprehensive benchmarks for evaluating unified multimodal models — curated from Awesome-Unified-Multimodal-Models. Covers understanding tasks, image generation quality, and interleaved generation evaluation.

Benchmarks on Understanding Tasks 17 benchmarks

Benchmark	Paper Title	Venue	Year	Code
GENIUS	GENIUS: Generative Fluid Intelligence Evaluation Suite	arXiv	2026	—
PaperBanana	PaperBanana: Automating Academic Illustration for AI Scientists	arXiv	2026	—
VisGym	VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents	arXiv	2026	—
MMDeepResearch-Bench	MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents	arXiv	2026	—
Think-Clip-Sample	Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding	arXiv	2026	—
VPBench	Visually Prompted Benchmarks Are Surprisingly Fragile	arXiv	2026	—
General-Bench	On Path to Multimodal Generalist: General-Level and General-Bench	ICML	2025	✅
MM-Vet v2	A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities	arXiv	2024	✅
OwlEval	mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2024	✅
oVQA	Open-ended VQA benchmarking of Vision-Language models	ICLR	2024	✅
SEED-Bench-2	Benchmarking Multimodal Large Language Models	arXiv	2023	✅
MMMU	A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark	CVPR	2023	✅
MM-Vet	Evaluating Large Multimodal Models for Integrated Capabilities	ICML	2023	✅
SEED-Bench	Benchmarking Multimodal LLMs with Generative Comprehension	CVPR	2023	✅
MMBench	Is Your Multi-modal Model an All-around Player?	ECCV	2023	✅
LAMM	Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark	NeurIPS	2023	✅
HaluEval	A Large-Scale Hallucination Evaluation Benchmark for Large Language Models	EMNLP	2023	✅
GQA	A New Dataset for Real-World Visual Reasoning and Compositional Question Answering	CVPR	2019	✅
VQA	Visual Question Answering	ICCV	2015	—

Benchmarks on Image Generation Tasks 15 benchmarks

Benchmark	Paper Title	Venue	Year	Code
GenExam	A Multidisciplinary Text-to-Image Exam	arXiv	2025	✅
OneIG-Bench	Omni-dimensional Nuanced Evaluation for Image Generation	arXiv	2025	✅
WISE	A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation	arXiv	2025	✅
DreamBench++	A Human-Aligned Benchmark for Personalized Image Generation	ICLR	2025	✅
T2I-CompBench++	Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation	TPAMI	2025	✅
GenAI-Bench	Evaluating and Improving Compositional Text-to-Visual Generation	CVPR	2024	✅
ConceptMix	A Compositional Image Generation Benchmark with Controllable Difficulty	NeurIPS	2024	✅
VQAScore	Evaluating Text-to-Visual Generation with Image-to-Text Generation	ECCV	2024	✅
GenEval	An Object-Focused Framework for Evaluating Text-to-Image Alignment	NeurIPS	2023	✅
T2I-CompBench	Comprehensive Benchmark for Open-world Compositional Text-to-image Generation	NeurIPS	2023	✅
DPG-Bench	ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment	arXiv	2024	✅
HEIM	Holistic Evaluation of Text-To-Image Models	NeurIPS	2023	✅
TIFA	Accurate and Interpretable Text-to-Image Faithfulness Evaluation with QA	ICCV	2023	✅
PartiPrompts	Scaling Autoregressive Models for Content-Rich Text-to-Image Generation	TMLR	2022	✅
DrawBench	Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding	NeurIPS	2022	—

Benchmarks on Interleaved / Compositional Tasks 7 benchmarks

Benchmark	Paper Title	Venue	Year	Code
VTBench	Evaluating Visual Tokenizers for Autoregressive Image Generation	arXiv	2025	✅
UniBench	Unified Holistic Evaluation for Unified Multimodal Understanding and Generation	arXiv	2025	✅
OpenING	Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation	CVPR	2024	✅
ISG	Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment	ICLR	2024	✅
MMIE	Massive Multimodal Interleaved Comprehension Benchmark	ICLR	2024	✅
InterleavedBench	Holistic Evaluation for Interleaved Text-and-Image Generation	EMNLP	2024	—
OpenLEAF	Open-Domain Interleaved Image-Text Generation and Evaluation	MM	2023	—

Unified Models Survey 1 survey

Title	Domain	Venue	Year
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities	Unified	arXiv	2025