Text-to-Image Generation

The progressive evolution of text-conditioned image synthesis — from generative adversarial networks through denoising diffusion to autoregressive next-token prediction.

Text-to-image (T2I) generation has undergone three paradigmatic shifts: GAN-based approaches (2016–2021), diffusion-based models (2021–present), and the emerging autoregressive transformer paradigm. Each generation has brought dramatic improvements in fidelity, controllability, and compositional reasoning. Explore the six sub-domains below.

Foundational Models & Face Synthesis

Core T2I architectures across three paradigmatic eras — Diffusion/Transformer (2024–25), Latent Diffusion (2023), and GAN/Early Diffusion (2020–22) — plus the specialized subfield of text-to-face generation.

Controllable & Compositional Generation

Methods enabling fine-grained spatial, structural, or attribute-level control — layout-guided, pose-guided, grounded generation, and compositional attention mechanisms.

Editing, Personalization & Prompts

Text-guided image editing and manipulation, subject-driven personalized generation, and prompt engineering optimization techniques.

Safety, Evaluation & Applications

Evaluation frameworks, safety and bias analysis, robustness research, and downstream applications including segmentation, restoration, and text rendering.

Cross-Modal: Video, 3D & Motion

Natural extensions of T2I into the temporal and spatial domains — text-to-video, text-to-3D, motion generation, and shape synthesis.

Arena Leaderboards & Benchmarks

Live arena rankings from human preference votes (LM Arena, Artificial Analysis), established benchmarks, quantitative metrics, training datasets, and surveys.

46 ranked models · 12 benchmarks →