Text-to-Video Generation
The progressive evolution of text-conditioned video synthesis — from early adversarial and autoregressive approaches through denoising diffusion to contemporary diffusion transformer (DiT) architectures.
Text-to-video (T2V) generation has undergone three paradigmatic shifts: early GAN-based and autoregressive approaches (2017–2022), video diffusion models (2022–2024),
and the emerging diffusion transformer (DiT) paradigm that scales to high-resolution, long-duration synthesis. Milestone systems such as Sora (OpenAI), Wan (Alibaba),
and HunyuanVideo (Tencent) exemplify the rapid convergence toward photorealistic, temporally coherent video generation from natural language. Explore the three sub-domains below.
Foundation Models & Toolboxes
Core T2V architectures, open-source platforms, and commercial systems — from GANs to diffusion transformers. Comprehensive coverage of foundational models and development toolboxes.
Controllable, Efficient & Long Video
Camera/motion/trajectory control, inference acceleration, and techniques for generating longer and higher-quality videos.
Benchmarks, Datasets & Metrics
Evaluation benchmarks (VBench, EvalCrafter), large-scale video-text datasets, and standard metrics for video generation quality.