← Text-to-Image

Cross-Modal: Video, 3D & Motion

Natural extensions of T2I into the temporal and spatial domains — text-to-video, text-to-3D, motion generation, and shape synthesis.

⌘K

Text-to-Video Generation 15 papers

A natural extension of text-to-image synthesis into the temporal domain — generating coherent video sequences from textual descriptions using diffusion, autoregressive, and hybrid architectures.
ModelFull TitleVenueYear
SoraVideo generation models as world simulators OpenAI Tech Report2024
Movie GenMovie Gen: A Cast of Media Foundation Models Meta2024
CogVideoXCogVideoX: Text-to-Video Diffusion Models with An Expert Transformer arXiv2024
HunyuanVideoHunyuanVideo: A Systematic Framework For Large Video Generation Model Tencent2024
Wan-VideoWan: Open and Advanced Large-Scale Video Generative Models Alibaba2025
Step-VideoStep-Video-T2V: A New Paradigm for Long Video Generation StepFun2025
SkyReels-V2SkyReels-V2: Infinite-Length Film Generation with Diffusion Forcing Kunlun2025
Align your LatentsHigh-Resolution Video Synthesis with Latent Diffusion Models CVPR2023
LaVIEHigh-Quality Video Generation with Cascaded Latent Diffusion arXiv2023
Emu VideoFactorizing Text-to-Video Generation by Explicit Image Conditioning arXiv2023
Make-A-VideoText-to-Video Generation without Text-Video Data arXiv2022
Imagen VideoHigh Definition Video Generation with Diffusion Models arXiv2022
CogVideoLarge-scale Pretraining for Text-to-Video via Transformers arXiv2022
Video Diffusion ModelsFoundational video diffusion framework arXiv2022
Lumina-T2XTransforming Text into Any Modality via Flow-based Large DiT arXiv2024

Text-to-3D, Motion & Shape Generation 12 papers

ModelFull TitleVenueYear
TrellisTRELLIS: Structured 3D Latents for Scalable and Versatile 3D Generation Microsoft2025
InstantMeshInstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-View LRMs arXiv2024
TripoSRTripoSR: Fast 3D Object Reconstruction from a Single Image StabilityAI2024
Rodin Gen-1Rodin Gen-1: Autoregressive Generation Beats Diffusion for 3D Generation Microsoft2025
Meta 3D GenText-to-Mesh with High-Quality Geometry and PBR Materials Meta2024
LATTE3DLarge-scale Amortized Text-To-Enhanced3D Synthesis arXiv2024
ProlificDreamerHigh-Fidelity Text-to-3D with Variational Score Distillation arXiv2023
DreamFusionText-to-3D using 2D Diffusion ICLR2023
Magic3DHigh-Resolution Text-to-3D Content Creation arXiv2022
Point-EGenerating 3D Point Clouds from Complex Prompts arXiv2022
T2M-GPTGenerating Human Motion from Textual Descriptions arXiv2023
Human Motion DiffusionHuman Motion Diffusion Model arXiv2022