← Text-to-Image

Foundational Models & Face Synthesis

Core T2I architectures across three paradigmatic eras, plus the specialized subfield of text-to-face generation.

⌘K

Diffusion & Transformer Era (2024–2026) 36+ papers

ModelFull TitleVenueYear
FLUX.2FLUX.2: Next-Generation Flow Matching for Ultra-High-Quality Image Synthesis BFL2025
UltraFluxUltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios arXiv2025
CoF-T2ICoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation arXiv2026
Recraft V3Recraft V3: Scalable Generative Image Model with Industry-Grade Quality Recraft2025
Ideogram 2.0Ideogram 2.0: Advanced Text Rendering and Compositional Image Synthesis Ideogram2025
HunyuanDiT 2.0HunyuanDiT 2.0: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Tencent2025
StoryDiffusionStoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation CVPR2025
FLUX.1FLUX.1: State-of-the-Art Open-Source Text-to-Image Model BFL2024
Stable Diffusion 3.5Stable Diffusion 3.5: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis StabilityAI2024
AuraFlowAuraFlow: Open-Source Flow-Based T2I Generation Model Fal.ai2024
MeissonicMeissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis arXiv2024
OmniGenOmniGen: Unified Image Generation arXiv2024
Lumina-NextLumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT arXiv2024
HiDiffusionHiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models arXiv2025
CogView4CogView4: 16K-Resolution Image Generation with Relay Diffusion Zhipu AI2025
Seedream 3.0Seedream 3.0: Scaling Up Diffusion Transformers ByteDance2025
GenExamA Multidisciplinary Text-to-Image Exam arXiv2025
RefVNLITowards Scalable Evaluation of Subject-driven Text-to-image Generation arXiv2025
GPT-4o Image StudyAn Empirical Study of GPT-4o Image Generation Capabilities arXiv2025
Imagen 3Imagen 3 Google DeepMind2024
PixArt-αFast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis ICLR2024
PixArt-ΣWeak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation arXiv2024
PixArt-δFast and Controllable Image Generation with Latent Consistency Models arXiv2024
SDXL-LightningProgressive Adversarial Diffusion Distillation arXiv2024
KolorsEffective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis Kuaishou2024
MARSMixture of Auto-Regressive Models for Fine-grained Text-to-Image Synthesis arXiv2024
Kandinsky 3Text-to-Image Synthesis for Multifunctional Generative Framework EMNLP2024
RealCompoDynamic Equilibrium between Realism and Compositionality Improves T2I Diffusion Models arXiv2024
ECLIPSEA Resource-Efficient Text-to-Image Prior for Image Generations CVPR2024
RanniTaming Text-to-Image Diffusion for Accurate Instruction Following CVPR2024
DiffusionGPTLLM-Driven Text-to-Image Generation System arXiv2024
Playground v2.5Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation arXiv2024
DimbaTransformer-Mamba Diffusion Models arXiv2024
SELMALearning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data arXiv2024
RealCustomNarrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization CVPR2024
CoMatAligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching arXiv2024
TextCraftorYour Text Encoder Can be Image Quality Controller arXiv2024
AutoStudioCrafting Consistent Subjects in Multi-turn Interactive Image Generation arXiv2024
TheaterGenCharacter Management with LLM for Consistent Multi-turn Image Generation arXiv2024
Flow Generator MatchingFlow Generator Matching arXiv2024
Lumina-T2XTransforming Text into Any Modality via Flow-based Large DiT arXiv2024
4M-21An Any-to-Any Vision Model for Tens of Tasks and Modalities arXiv2024

Latent Diffusion & Transformer Era (2023) 30+ papers

ModelFull TitleVenueYear
ControlNetAdding Conditional Control to Text-to-Image Diffusion Models ICCV2023
GLIGENOpen-Set Grounded Text-to-Image Generation CVPR2023
Attend-and-ExciteAttention-Based Semantic Guidance for Text-to-Image Diffusion Models arXiv2023
GALIPGenerative Adversarial CLIPs for Text-to-Image Synthesis CVPR2023
MuseText-To-Image Generation via Masked Generative Transformers arXiv2023
StyleDropText-to-Image Generation in Any Style arXiv2023
Prompt-Free DiffusionTaking "Text" out of Text-to-Image Diffusion Models arXiv2023
Visual ChatGPTTalking, Drawing and Editing with Visual Foundation Models arXiv2023
KandinskyAn Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion arXiv2023
Pick-a-PicAn Open Dataset of User Preferences for Text-to-Image Generation arXiv2023
eDiff-IText-to-Image Diffusion Models with an Ensemble of Expert Denoisers arXiv2023
Blended Latent DiffusionBlended Latent Diffusion SIGGRAPH2023
The Chosen OneConsistent Characters in Text-to-Image Diffusion Models arXiv2023
UFOGenYou Forward Once Large Scale Text-to-Image Generation via Diffusion GANs arXiv2023
BoxDiffText-to-Image Synthesis with Training-Free Box-Constrained Diffusion ICCV2023
ITI-GENInclusive Text-to-Image Generation ICCV2023
Mini-DALLE3Interactive Text to Image by Prompting Large Language Models arXiv2023
T2I-CompBenchA Comprehensive Benchmark for Open-world Compositional Text-to-image Generation arXiv2023
DiffBlenderScalable and Composable Multimodal Text-to-Image Diffusion Models arXiv2023
ElasticDiffusionTraining-free Arbitrary Size Image Generation arXiv2023
Multi-Concept CustomizationMulti-Concept Customization of Text-to-Image Diffusion CVPR2023
BLIP-DiffusionPre-trained Subject Representation for Controllable Text-to-Image Generation and Editing arXiv2023
Universal GuidanceUniversal Guidance for Diffusion Models arXiv2023
AltCLIP / AltDiffusionAltering the Language Encoder in CLIP for Extended Language Capabilities ACL Findings2023
Expressive Rich TextExpressive Text-to-Image Generation with Rich Text ICCV2023
Scaling up GANsScaling up GANs for Text-to-Image Synthesis CVPR2023
CoDi-2In-Context, Interleaved, and Interactive Any-to-Any Generation arXiv2023
Detector GuidanceDetector Guidance for Multi-Object Text-to-Image Generation arXiv2023
A-STARTest-time Attention Segregation and Retention for Text-to-image Synthesis arXiv2023
Training-Free Structured DiffusionTraining-Free Structured Diffusion Guidance for Compositional T2I Synthesis ICLR2023

GAN & Early Diffusion Era (2020–2022) 15+ papers

ModelFull TitleVenueYear
Stable Diffusion (LDM)High-Resolution Image Synthesis with Latent Diffusion Models CVPR2022
ImagenPhotorealistic Text-to-Image Diffusion Models with Deep Language Understanding NeurIPS2022
DALL·E 2Hierarchical Text-Conditional Image Generation with CLIP Latents arXiv2022
PartiScaling Autoregressive Models for Content-Rich Text-to-Image Generation TMLR2022
OFA (Unified-IO)Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework arXiv2022
Versatile DiffusionText, Images and Variations All in One Diffusion Model arXiv2022
FridoFeature Pyramid Diffusion for Complex Scene Image Synthesis arXiv2022
NUWA-InfinityAutoregressive over Autoregressive Generation for Infinite Visual Synthesis arXiv2022
NÜWAVisual Synthesis Pre-training for Neural visUal World creAtion ECCV2022
DALL·EZero-Shot Text-to-Image Generation arXiv2021
L-VerseBidirectional Generation Between Image and Text arXiv2021
ERNIE-ViLGUnified Generative Pre-training for Bidirectional Vision-Language Generation (10B parameters) arXiv2021
M6-UFCUnifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers NeurIPS2021
ManiGANText-Guided Image Manipulation CVPR2020
TAGANText-adaptive Generative Adversarial Networks: Manipulating Images with Natural Language NeurIPS2018

Text-to-Face Synthesis 22+ papers

A specialized subfield dedicated to generating and manipulating human facial imagery from textual descriptions, encompassing 2D face synthesis, 3D avatar generation, and attribute-level control.
ModelFull TitleVenueYear
PreciseControlEnhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control ECCV2024
CosmicManA Text-to-Image Foundation Model for Humans CVPR2024
15M Facial Dataset15M Multimodal Facial Image-Text Dataset arXiv2024
Portrait3DText-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior arXiv2024
Fast T2-3D FaceFast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping ICML2024
Celeb BasisInserting Anybody in Diffusion Models via Celeb Basis NeurIPS2023
DreamFaceProgressive Generation of Animatable 3D Faces under Text Guidance SIGGRAPH2023
Collaborative DiffusionMulti-Modal Face Generation and Editing CVPR2023
High-Fidelity 3D FaceHigh-Fidelity 3D Face Generation from Natural Language Descriptions CVPR2023
Mukh-OboyobStable Diffusion and BanglaBERT enhanced Bangla Text-to-Face Synthesis IJACSA2023
clip2latentText driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP BMVC2022
AnyFaceFree-style Text-to-Face Synthesis and Manipulation CVPR2022
StyleT2IToward Compositional and High-Fidelity Text-to-Image Synthesis CVPR2022
CMAFGANA Cross-Modal Attention Fusion based GAN for Attribute Word-to-Face Synthesis Knowledge-Based Systems2022
DualG-GANA Dual-channel Generator based GAN for Text-to-Face Synthesis Neural Networks2022
ManiCLIPMulti-Attribute Face Manipulation from Text arXiv2022
TextFaceText-to-Style Mapping based Face Generation and Manipulation IEEE TNSE2022
TediGANText-Guided Diverse Image Generation and Manipulation CVPR2021
FG-GANGenerative Adversarial Network for Text-to-Face Synthesis with Pretrained BERT FG2021
Multi-caption T2FMulti-caption Text-to-Face Synthesis: Dataset and Algorithm ACMMM2021
Faces à la CarteText-to-Face Generation via Attribute Disentanglement WACV2021
FTGANA Fully-trained Generative Adversarial Networks for Text to Face Generation arXiv2019