Model Architectures

The full spectrum of unified multimodal model designs — diffusion-based, autoregressive, hybrid, and any-to-any architectures.

⌘K

Approaches are categorized by their backbone architecture — following the taxonomy from Awesome-Unified-Multimodal-Models. Includes diffusion-based, autoregressive (MLLM), hybrid AR-diffusion, and any-to-any architectures.

Diffusion-Based Unified Models 7 papers

Model	Full Title	Venue	Date	Code
UniModel	A Visual-Only Framework for Unified Multimodal Understanding and Generation	arXiv	2025/11	—
Lavida-O	Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation	arXiv	2025/09	✅
Muddit	Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model	arXiv	2025/05	✅
FUDOKI	Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities	arXiv	2025/05	—
MMaDA	Multimodal Large Diffusion Language Models	arXiv	2025/05	✅
UniDisc	Unified Multimodal Discrete Diffusion	arXiv	2025/03	✅
Dual Diffusion	Dual Diffusion for Unified Image Generation and Understanding	arXiv	2024/12	✅

MLLM Autoregressive — Pixel Encoding 14 papers

Model	Full Title	Venue	Date	Code
Emu3.5	Native Multimodal Models are World Learners	arXiv	2025/10	✅
OneCat	Decoder-Only Auto-Regressive Model for Unified Understanding and Generation	arXiv	2025/09	✅
Selftok	Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning	arXiv	2025/05	✅
TokLIP	Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation	arXiv	2025/05	✅
Harmon	Harmonizing Visual Representations for Unified Multimodal Understanding and Generation	arXiv	2025/03	✅
UGen	Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning	arXiv	2025/03	—
SynerGen-VL	Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv	2024/12	—
Liquid	Language Models are Scalable and Unified Multi-modal Generators	arXiv	2024/12	✅
Orthus	Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads	arXiv	2024/11	✅
ANOLE	An Open, Autoregressive, Native Large Multimodal Model for Interleaved Image-Text Generation	arXiv	2024/07	✅
Chameleon	Mixed-Modal Early-Fusion Foundation Models	arXiv	2024/05	✅
LWM	World Model on Million-Length Video And Language With Blockwise RingAttention	ICLR	2024/02	✅
Emu3	Emu3: Next-Token Prediction is All You Need	arXiv	2024/09	✅
Emu	Generative Pretraining in Multimodality	ICLR	2023/07	✅

MLLM Autoregressive — Semantic Encoding 22 papers

Model	Full Title	Venue	Date	Code
UniCorn	Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision	arXiv	2026/01	—
ORION	ORION: Decoupling and Alignment for Unified Autoregressive Understanding and Generation	ICLR	2026	—
Ming-UniVision	Joint Image Understanding and Generation with a Unified Continuous Tokenizer	arXiv	2025/10	✅
Bifrost-1	Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents	arXiv	2025/08	✅
Qwen-Image	Qwen-Image Technical Report	arXiv	2025/08	✅
X-Omni	Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again	arXiv	2025/07	✅
Ovis-U1	Ovis-U1 Technical Report	arXiv	2025/06	✅
UniCode²	Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation	arXiv	2025/06	—
OmniGen2	Exploration to Advanced Multimodal Generation	arXiv	2025/06	✅
Tar	Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations	arXiv	2025/06	✅
UniFork	Exploring Modality Alignment for Unified Multimodal Understanding and Generation	arXiv	2025/06	✅
UniWorld	High-Resolution Semantic Encoders for Unified Visual Understanding and Generation	arXiv	2025/06	✅
Pisces	An Auto-regressive Foundation Model for Image Understanding and Generation	arXiv	2025/06	✅
DualToken	Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies	arXiv	2025/03	✅
UniTok	A Unified Tokenizer for Visual Generation and Understanding	arXiv	2025/02	✅
QLIP	Text-Aligned Visual Tokenization Unifies AR Multimodal Understanding and Generation	arXiv	2025/02	✅
MetaMorph	Multimodal Understanding and Generation via Instruction Tuning	arXiv	2024/12	✅
ILLUME	Illuminating Your LLMs to See, Draw, and Self-Enhance	arXiv	2024/12	—
PUMA	Empowering Unified MLLM with Multi-granular Visual Generation	arXiv	2024/10	✅
VILA-U	Unified Foundation Model Integrating Visual Understanding and Generation	ICLR	2024/09	✅
Mini-Gemini	Mining the Potential of Multi-modality Vision Language Models	arXiv	2024/03	✅
Emu2	Generative Multimodal Models are In-Context Learners	CVPR	2023/12	✅

MLLM Autoregressive — Learnable Query Encoding 11 papers

Model	Full Title	Venue	Date	Code
UniPic 2.0	Building Kontext Model with Online RL for Unified Multimodal Model	arXiv	2025/09	✅
TBAC-UniImage	Unified Understanding and Generation by Ladder-Side Diffusion Tuning	arXiv	2025/08	✅
UniLIP	Adapting CLIP for Unified Multimodal Understanding, Generation and Editing	arXiv	2025/07	—
OpenUni	A Simple Baseline for Unified Multimodal Understanding and Generation	arXiv	2025/05	✅
BLIP3-o	A Family of Fully Open Unified Multimodal Models	arXiv	2025/05	✅
Ming-Lite-Uni	Advancements in Unified Architecture for Natural Multimodal Interaction	arXiv	2025/05	✅
Nexus-Gen	A Unified Model for Image Understanding, Generation, and Editing	arXiv	2025/04	✅
MetaQueries	Transfer between Modalities with MetaQueries	arXiv	2025/04	—
SEED-X	Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024/04	✅
SEED-LLaMA	Making LLaMA SEE and Draw with SEED Tokenizer	ICLR	2023/10	✅
SEED	Planting a SEED of Vision in Large Language Model	arXiv	2023/07	✅

Hybrid Encoding — Pseudo & Joint 14 papers

Model	Full Title	Architecture	Venue	Date
Skywork UniPic	Unified Autoregressive Modeling for Visual Understanding and Generation	Pseudo	arXiv	2025/08
MindOmni	Unleashing Reasoning Generation in Vision Language Models with RGPO	Pseudo	arXiv	2025/05
UniFluid	Unified Autoregressive Visual Generation and Understanding with Continuous Tokens	Pseudo	arXiv	2025/03
OmniMamba	Efficient and Unified Multimodal Understanding and Generation via State Space Models	Pseudo	arXiv	2025/03
Janus-Pro	Unified Multimodal Understanding and Generation with Data and Model Scaling	Pseudo	arXiv	2025/01
Janus	Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	Pseudo	arXiv	2024/10
Show-o2	Improved Native Unified Multimodal Models	Joint	arXiv	2025/06
UniToken	Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding	Joint	CVPRW	2025/04
VARGPT-v1.1	Improve Visual AR Large Unified Model via Iterative Instruction Tuning and RL	Joint	arXiv	2025/04
ILLUME+	Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement	Joint	arXiv	2025/04
VARGPT	Unified Understanding and Generation in a Visual Autoregressive MLLM	Joint	arXiv	2025/01
TokenFlow	Unified Image Tokenizer for Multimodal Understanding and Generation	Joint	CVPR	2024/12
MUSE-VL	Modeling Unified VLM through Semantic Discrete Encoding	Joint	arXiv	2024/11
SemHiTok	Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation	Joint	arXiv	2025/03

MLLM AR-Diffusion Architectures 13 papers

Model	Full Title	Venue	Date	Code
MammothModa2	A Unified AR-Diffusion Framework for Multimodal Understanding and Generation	arXiv	2025/11	✅
EMMA	Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture	arXiv	2025/12	✅
HBridge	H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation	arXiv	2025/11	—
TUNA	Taming Unified Visual Representations for Native Unified Multimodal Models	arXiv	2025/12	✅
LightFusion	A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation	arXiv	2025/10	✅
BAGEL	Emerging Properties in Unified Multimodal Pretraining	arXiv	2025/05	✅
Mogao	An Omni Foundation Model for Interleaved Multi-Modal Generation	arXiv	2025/05	—
LMFusion	Adapting Pretrained Language Models for Multimodal Generation	arXiv	2024/12	—
MonoFormer	One Transformer for Both Diffusion and Autoregression	arXiv	2024/09	✅
Show-o	One Single Transformer to Unify Multimodal Understanding and Generation	ICLR	2024/08	✅
Transfusion	Predict the Next Token and Diffuse Images with One Multi-Modal Model	ICLR	2024/08	✅
JanusFlow	Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	arXiv	2024/11	✅
UniReal	Universal Image Generation and Editing via Learning Real-World Dynamics	arXiv	2025	—

Any-to-Any Multimodal Models 14 papers

Model	Full Title	Venue	Date	Code
LongCat-Flash-Omni	LongCat-Flash-Omni Technical Report	arXiv	2025/11	✅
Ming-Flash-Omni	A Sparse, Unified Architecture for Multimodal Perception and Generation	arXiv	2025/10	✅
Qwen3-Omni	Qwen3-Omni Technical Report	arXiv	2025/09	✅
Ming-Omni	A Unified Multimodal Model for Perception and Generation	arXiv	2025/06	✅
M2-omni	Advancing Omni-MLLM for Comprehensive Modality Support	arXiv	2025/02	—
OmniFlow	Any-to-Any Generation with Multi-Modal Rectified Flows	CVPR	2024/12	✅
Spider	Any-to-Many Multimodal LLM	arXiv	2024/11	✅
MIO	A Foundation Model on Multimodal Tokens	arXiv	2024/09	✅
X-VILA	Cross-Modality Alignment for Large Language Model	arXiv	2024/05	—
AnyGPT	Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024/02	✅
Video-LaVIT	Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	ICML	2024/02	✅
Unified-IO 2	Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	CVPR	2023/12	✅
NExT-GPT	Any-to-Any Multimodal LLM	ICML	2023/09	✅
ImageBind	One Embedding Space To Bind Them All	CVPR	2023	✅

Applications & Opportunities 4 papers

Model	Full Title	Venue	Date	Code
UniGame	Turning a Unified Multimodal Model Into Its Own Adversary	arXiv	2025/11	✅
UniCTokens	Boosting Personalized Understanding and Generation via Unified Concept Tokens	arXiv	2025/05	✅
T2I-R1	Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT	arXiv	2025/01	✅
Fair-UMLLM	On Fairness of Unified Multimodal Large Language Model for Image Generation	arXiv	2025/02	—