← Unified Models

Model Architectures

The full spectrum of unified multimodal model designs — diffusion-based, autoregressive, hybrid, and any-to-any architectures.

⌘K

Approaches are categorized by their backbone architecture — following the taxonomy from Awesome-Unified-Multimodal-Models. Includes diffusion-based, autoregressive (MLLM), hybrid AR-diffusion, and any-to-any architectures.

Diffusion-Based Unified Models 7 papers

ModelFull TitleVenueDateCode
UniModelA Visual-Only Framework for Unified Multimodal Understanding and Generation arXiv2025/11
Lavida-OElastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation arXiv2025/09
MudditLiberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model arXiv2025/05
FUDOKIDiscrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities arXiv2025/05
MMaDAMultimodal Large Diffusion Language Models arXiv2025/05
UniDiscUnified Multimodal Discrete Diffusion arXiv2025/03
Dual DiffusionDual Diffusion for Unified Image Generation and Understanding arXiv2024/12

MLLM Autoregressive — Pixel Encoding 14 papers

ModelFull TitleVenueDateCode
Emu3.5Native Multimodal Models are World Learners arXiv2025/10
OneCatDecoder-Only Auto-Regressive Model for Unified Understanding and Generation arXiv2025/09
SelftokDiscrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning arXiv2025/05
TokLIPMarry Visual Tokens to CLIP for Multimodal Comprehension and Generation arXiv2025/05
HarmonHarmonizing Visual Representations for Unified Multimodal Understanding and Generation arXiv2025/03
UGenUnified Autoregressive Multimodal Model with Progressive Vocabulary Learning arXiv2025/03
SynerGen-VLTowards Synergistic Image Understanding and Generation with Vision Experts and Token Folding arXiv2024/12
LiquidLanguage Models are Scalable and Unified Multi-modal Generators arXiv2024/12
OrthusAutoregressive Interleaved Image-Text Generation with Modality-Specific Heads arXiv2024/11
ANOLEAn Open, Autoregressive, Native Large Multimodal Model for Interleaved Image-Text Generation arXiv2024/07
ChameleonMixed-Modal Early-Fusion Foundation Models arXiv2024/05
LWMWorld Model on Million-Length Video And Language With Blockwise RingAttention ICLR2024/02
Emu3Emu3: Next-Token Prediction is All You Need arXiv2024/09
EmuGenerative Pretraining in Multimodality ICLR2023/07

MLLM Autoregressive — Semantic Encoding 22 papers

ModelFull TitleVenueDateCode
UniCornTowards Self-Improving Unified Multimodal Models through Self-Generated Supervision arXiv2026/01
ORIONORION: Decoupling and Alignment for Unified Autoregressive Understanding and Generation ICLR2026
Ming-UniVisionJoint Image Understanding and Generation with a Unified Continuous Tokenizer arXiv2025/10
Bifrost-1Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents arXiv2025/08
Qwen-ImageQwen-Image Technical Report arXiv2025/08
X-OmniReinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again arXiv2025/07
Ovis-U1Ovis-U1 Technical Report arXiv2025/06
UniCode²Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation arXiv2025/06
OmniGen2Exploration to Advanced Multimodal Generation arXiv2025/06
TarVision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations arXiv2025/06
UniForkExploring Modality Alignment for Unified Multimodal Understanding and Generation arXiv2025/06
UniWorldHigh-Resolution Semantic Encoders for Unified Visual Understanding and Generation arXiv2025/06
PiscesAn Auto-regressive Foundation Model for Image Understanding and Generation arXiv2025/06
DualTokenTowards Unifying Visual Understanding and Generation with Dual Visual Vocabularies arXiv2025/03
UniTokA Unified Tokenizer for Visual Generation and Understanding arXiv2025/02
QLIPText-Aligned Visual Tokenization Unifies AR Multimodal Understanding and Generation arXiv2025/02
MetaMorphMultimodal Understanding and Generation via Instruction Tuning arXiv2024/12
ILLUMEIlluminating Your LLMs to See, Draw, and Self-Enhance arXiv2024/12
PUMAEmpowering Unified MLLM with Multi-granular Visual Generation arXiv2024/10
VILA-UUnified Foundation Model Integrating Visual Understanding and Generation ICLR2024/09
Mini-GeminiMining the Potential of Multi-modality Vision Language Models arXiv2024/03
Emu2Generative Multimodal Models are In-Context Learners CVPR2023/12

MLLM Autoregressive — Learnable Query Encoding 11 papers

ModelFull TitleVenueDateCode
UniPic 2.0Building Kontext Model with Online RL for Unified Multimodal Model arXiv2025/09
TBAC-UniImageUnified Understanding and Generation by Ladder-Side Diffusion Tuning arXiv2025/08
UniLIPAdapting CLIP for Unified Multimodal Understanding, Generation and Editing arXiv2025/07
OpenUniA Simple Baseline for Unified Multimodal Understanding and Generation arXiv2025/05
BLIP3-oA Family of Fully Open Unified Multimodal Models arXiv2025/05
Ming-Lite-UniAdvancements in Unified Architecture for Natural Multimodal Interaction arXiv2025/05
Nexus-GenA Unified Model for Image Understanding, Generation, and Editing arXiv2025/04
MetaQueriesTransfer between Modalities with MetaQueries arXiv2025/04
SEED-XMultimodal Models with Unified Multi-granularity Comprehension and Generation arXiv2024/04
SEED-LLaMAMaking LLaMA SEE and Draw with SEED Tokenizer ICLR2023/10
SEEDPlanting a SEED of Vision in Large Language Model arXiv2023/07

Hybrid Encoding — Pseudo & Joint 14 papers

ModelFull TitleArchitectureVenueDate
Skywork UniPicUnified Autoregressive Modeling for Visual Understanding and Generation PseudoarXiv2025/08
MindOmniUnleashing Reasoning Generation in Vision Language Models with RGPO PseudoarXiv2025/05
UniFluidUnified Autoregressive Visual Generation and Understanding with Continuous Tokens PseudoarXiv2025/03
OmniMambaEfficient and Unified Multimodal Understanding and Generation via State Space Models PseudoarXiv2025/03
Janus-ProUnified Multimodal Understanding and Generation with Data and Model Scaling PseudoarXiv2025/01
JanusDecoupling Visual Encoding for Unified Multimodal Understanding and Generation PseudoarXiv2024/10
Show-o2Improved Native Unified Multimodal Models JointarXiv2025/06
UniTokenHarmonizing Multimodal Understanding and Generation through Unified Visual Encoding JointCVPRW2025/04
VARGPT-v1.1Improve Visual AR Large Unified Model via Iterative Instruction Tuning and RL JointarXiv2025/04
ILLUME+Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement JointarXiv2025/04
VARGPTUnified Understanding and Generation in a Visual Autoregressive MLLM JointarXiv2025/01
TokenFlowUnified Image Tokenizer for Multimodal Understanding and Generation JointCVPR2024/12
MUSE-VLModeling Unified VLM through Semantic Discrete Encoding JointarXiv2024/11
SemHiTokSemantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation JointarXiv2025/03

MLLM AR-Diffusion Architectures 13 papers

ModelFull TitleVenueDateCode
MammothModa2A Unified AR-Diffusion Framework for Multimodal Understanding and Generation arXiv2025/11
EMMAEfficient Multimodal Understanding, Generation, and Editing with a Unified Architecture arXiv2025/12
HBridgeH-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation arXiv2025/11
TUNATaming Unified Visual Representations for Native Unified Multimodal Models arXiv2025/12
LightFusionA Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation arXiv2025/10
BAGELEmerging Properties in Unified Multimodal Pretraining arXiv2025/05
MogaoAn Omni Foundation Model for Interleaved Multi-Modal Generation arXiv2025/05
LMFusionAdapting Pretrained Language Models for Multimodal Generation arXiv2024/12
MonoFormerOne Transformer for Both Diffusion and Autoregression arXiv2024/09
Show-oOne Single Transformer to Unify Multimodal Understanding and Generation ICLR2024/08
TransfusionPredict the Next Token and Diffuse Images with One Multi-Modal Model ICLR2024/08
JanusFlowHarmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation arXiv2024/11
UniRealUniversal Image Generation and Editing via Learning Real-World Dynamics arXiv2025

Any-to-Any Multimodal Models 14 papers

ModelFull TitleVenueDateCode
LongCat-Flash-OmniLongCat-Flash-Omni Technical Report arXiv2025/11
Ming-Flash-OmniA Sparse, Unified Architecture for Multimodal Perception and Generation arXiv2025/10
Qwen3-OmniQwen3-Omni Technical Report arXiv2025/09
Ming-OmniA Unified Multimodal Model for Perception and Generation arXiv2025/06
M2-omniAdvancing Omni-MLLM for Comprehensive Modality Support arXiv2025/02
OmniFlowAny-to-Any Generation with Multi-Modal Rectified Flows CVPR2024/12
SpiderAny-to-Many Multimodal LLM arXiv2024/11
MIOA Foundation Model on Multimodal Tokens arXiv2024/09
X-VILACross-Modality Alignment for Large Language Model arXiv2024/05
AnyGPTUnified Multimodal LLM with Discrete Sequence Modeling arXiv2024/02
Video-LaVITUnified Video-Language Pre-training with Decoupled Visual-Motional Tokenization ICML2024/02
Unified-IO 2Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action CVPR2023/12
NExT-GPTAny-to-Any Multimodal LLM ICML2023/09
ImageBindOne Embedding Space To Bind Them All CVPR2023

Applications & Opportunities 4 papers

ModelFull TitleVenueDateCode
UniGameTurning a Unified Multimodal Model Into Its Own Adversary arXiv2025/11
UniCTokensBoosting Personalized Understanding and Generation via Unified Concept Tokens arXiv2025/05
T2I-R1Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT arXiv2025/01
Fair-UMLLMOn Fairness of Unified Multimodal Large Language Model for Image Generation arXiv2025/02