Text-to-Video Generation

The progressive evolution of text-conditioned video synthesis — from early adversarial and autoregressive approaches through denoising diffusion to contemporary diffusion transformer (DiT) architectures.

Text-to-video (T2V) generation has undergone three paradigmatic shifts: early GAN-based and autoregressive approaches (2017–2022), video diffusion models (2022–2024), and the emerging diffusion transformer (DiT) paradigm that scales to high-resolution, long-duration synthesis. Milestone systems such as Sora (OpenAI), Wan (Alibaba), and HunyuanVideo (Tencent) exemplify the rapid convergence toward photorealistic, temporally coherent video generation from natural language. Explore the three sub-domains below.