Unified Multimodal Understanding & Generation
Transcending the boundary between perception and generation — architectures that concurrently comprehend and synthesize visual content within a unified parametric framework.
The pursuit of unified multimodal models represents a fundamental paradigm shift: rather than maintaining separate systems for visual understanding and visual generation, these architectures learn both capabilities within a shared representational space. This section comprehensively catalogues models, benchmarks, and datasets following the taxonomy from Awesome-Unified-Multimodal-Models.
Model Architectures
Diffusion-based, autoregressive (pixel, semantic & query encoding), hybrid AR-diffusion, any-to-any, and applied architectures — the full spectrum of unified model designs.
Evaluation
Understanding benchmarks, image generation benchmarks, interleaved/compositional evaluation, and comprehensive surveys.
Datasets
Multimodal understanding corpora, text-to-image datasets, image editing datasets, and interleaved image-text training data.