← Unified Models

Evaluation

Benchmarks for evaluating unified multimodal models — understanding, image generation, and interleaved tasks.

⌘K

Comprehensive benchmarks for evaluating unified multimodal models — curated from Awesome-Unified-Multimodal-Models. Covers understanding tasks, image generation quality, and interleaved generation evaluation.

Benchmarks on Understanding Tasks 13 benchmarks

BenchmarkPaper TitleVenueYearCode
General-BenchOn Path to Multimodal Generalist: General-Level and General-Bench ICML2025
MM-Vet v2A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities arXiv2024
OwlEvalmPLUG-Owl: Modularization Empowers Large Language Models with Multimodality arXiv2024
oVQAOpen-ended VQA benchmarking of Vision-Language models ICLR2024
SEED-Bench-2Benchmarking Multimodal Large Language Models arXiv2023
MMMUA Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark CVPR2023
MM-VetEvaluating Large Multimodal Models for Integrated Capabilities ICML2023
SEED-BenchBenchmarking Multimodal LLMs with Generative Comprehension CVPR2023
MMBenchIs Your Multi-modal Model an All-around Player? ECCV2023
LAMMLanguage-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark NeurIPS2023
HaluEvalA Large-Scale Hallucination Evaluation Benchmark for Large Language Models EMNLP2023
GQAA New Dataset for Real-World Visual Reasoning and Compositional Question Answering CVPR2019
VQAVisual Question Answering ICCV2015

Benchmarks on Image Generation Tasks 15 benchmarks

BenchmarkPaper TitleVenueYearCode
GenExamA Multidisciplinary Text-to-Image Exam arXiv2025
OneIG-BenchOmni-dimensional Nuanced Evaluation for Image Generation arXiv2025
WISEA World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation arXiv2025
DreamBench++A Human-Aligned Benchmark for Personalized Image Generation ICLR2025
T2I-CompBench++Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation TPAMI2025
GenAI-BenchEvaluating and Improving Compositional Text-to-Visual Generation CVPR2024
ConceptMixA Compositional Image Generation Benchmark with Controllable Difficulty NeurIPS2024
VQAScoreEvaluating Text-to-Visual Generation with Image-to-Text Generation ECCV2024
GenEvalAn Object-Focused Framework for Evaluating Text-to-Image Alignment NeurIPS2023
T2I-CompBenchComprehensive Benchmark for Open-world Compositional Text-to-image Generation NeurIPS2023
DPG-BenchELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment arXiv2024
HEIMHolistic Evaluation of Text-To-Image Models NeurIPS2023
TIFAAccurate and Interpretable Text-to-Image Faithfulness Evaluation with QA ICCV2023
PartiPromptsScaling Autoregressive Models for Content-Rich Text-to-Image Generation TMLR2022
DrawBenchPhotorealistic Text-to-Image Diffusion Models with Deep Language Understanding NeurIPS2022

Benchmarks on Interleaved / Compositional Tasks 7 benchmarks

BenchmarkPaper TitleVenueYearCode
VTBenchEvaluating Visual Tokenizers for Autoregressive Image Generation arXiv2025
UniBenchUnified Holistic Evaluation for Unified Multimodal Understanding and Generation arXiv2025
OpenINGComprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation CVPR2024
ISGInterleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment ICLR2024
MMIEMassive Multimodal Interleaved Comprehension Benchmark ICLR2024
InterleavedBenchHolistic Evaluation for Interleaved Text-and-Image Generation EMNLP2024
OpenLEAFOpen-Domain Interleaved Image-Text Generation and Evaluation MM2023

Unified Models Survey 1 survey

TitleDomainVenueYear
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities UnifiedarXiv2025