Unified Multimodal Understanding & Generation

Transcending the boundary between perception and generation — architectures that concurrently comprehend and synthesize visual content within a unified parametric framework.

The pursuit of unified multimodal models represents a fundamental paradigm shift: rather than maintaining separate systems for visual understanding and visual generation, these architectures learn both capabilities within a shared representational space. This section comprehensively catalogues models, benchmarks, and datasets following the taxonomy from Awesome-Unified-Multimodal-Models.

Unified Multimodal Understanding & Generation

Model Architectures

Evaluation

Datasets