The full spectrum of unified multimodal model designs — diffusion-based, autoregressive, hybrid, and any-to-any architectures.
⌘K
Approaches are categorized by their backbone architecture — following the taxonomy from Awesome-Unified-Multimodal-Models. Includes diffusion-based, autoregressive (MLLM), hybrid AR-diffusion, and any-to-any architectures.
Diffusion-Based Unified Models 7 papers
Model
Full Title
Venue
Date
Code
UniDFlow
UniDFlow: Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
arXiv
2026/02
—
UniModel
A Visual-Only Framework for Unified Multimodal Understanding and Generation
arXiv
2025/11
—
Lavida-O
Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation
arXiv
2025/09
✅
Muddit
Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model
arXiv
2025/05
✅
FUDOKI
Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities
arXiv
2025/05
—
MMaDA
Multimodal Large Diffusion Language Models
arXiv
2025/05
✅
UniDisc
Unified Multimodal Discrete Diffusion
arXiv
2025/03
✅
Dual Diffusion
Dual Diffusion for Unified Image Generation and Understanding
arXiv
2024/12
✅
MLLM Autoregressive — Pixel Encoding 14 papers
Model
Full Title
Venue
Date
Code
Emu3.5
Native Multimodal Models are World Learners
arXiv
2025/10
✅
OneCat
Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
arXiv
2025/09
✅
Selftok
Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
arXiv
2025/05
✅
TokLIP
Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
arXiv
2025/05
✅
Harmon
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
arXiv
2025/03
✅
UGen
Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
arXiv
2025/03
—
SynerGen-VL
Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
arXiv
2024/12
—
Liquid
Language Models are Scalable and Unified Multi-modal Generators
arXiv
2024/12
✅
Orthus
Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads
arXiv
2024/11
✅
ANOLE
An Open, Autoregressive, Native Large Multimodal Model for Interleaved Image-Text Generation
arXiv
2024/07
✅
Chameleon
Mixed-Modal Early-Fusion Foundation Models
arXiv
2024/05
✅
LWM
World Model on Million-Length Video And Language With Blockwise RingAttention