← Text-to-Video

Benchmarks, Datasets & Metrics

Evaluation frameworks, large-scale video-text datasets, and established quantitative metrics for assessing video generation quality.

⌘K

Video Generation Benchmarks (2023–2025) 16+ benchmarks

BenchmarkFocus / DescriptionVenueYear
VideoScoreVideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation arXiv2025
TC-BenchTC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation arXiv2025
ConsistBenchConsistBench: Benchmarking Consistency in Text-to-Video Models arXiv2025
VideoAutoArenaVideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis arXiv2025
VBench-2.0VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness arXiv2025
Stable CinemetricsStructured Taxonomy and Evaluation for Professional Video Generation arXiv2025
OpenS2V-NexusDetailed Benchmark and Million-Scale Dataset for Subject-to-Video arXiv2025
Impossible VideosPhysically impossible video evaluation for world models arXiv2025
MEt3RMeasuring Multi-View Consistency in Generated Images arXiv2025
T2V-CompBenchComprehensive Benchmark for Compositional Text-to-video Generation arXiv2024
ChronoMagic-BenchBenchmark for Metamorphic Evaluation of Text-to-Time-lapse Video NeurIPS2024
T2VScoreTowards A Better Metric for Text-to-Video Generation arXiv2024
EvalCrafterBenchmarking and Evaluating Large Video Generation Models CVPR2024
VBenchComprehensive Benchmark Suite for Video Generative Models arXiv2023
FETVBenchmark for Fine-Grained Evaluation of Open-Domain T2V arXiv2023
StoryBenchMultifaceted Benchmark for Continuous Story Visualization NeurIPS2023
PEEKABOOInteractive Video Generation via Masked-Diffusion CVPR2024
PhyGenBenchWorld Simulator: Crafting Physical Commonsense-Based Benchmark arXiv2024
VideoPhyEvaluating Physical Commonsense for Video Generation arXiv2024
GenAI-BenchEvaluating Text-to-Visual Generation with Image-to-Text Generation arXiv2024
VidProMMillion-scale Real Prompt-Gallery Dataset for T2V Diffusion Models arXiv2024
VBench++VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models arXiv2024
GameGen-BenchPhyGenBench: Crafting Physical Commonsense-Based Benchmark for Video Generation arXiv2025

Large-Scale Video-Text Datasets 28+ datasets

DatasetDomainScaleResolutionYear
MiraData Open-Domain / General330K clips4K2025
OpenVid-1M Open-Domain1M clipsHD2024
InternVid-G General7M clipsHD2025
OpenS2V-Nexus Open5.4M clips720P2025
ChronoMagic-Pro Open460K videos720P2024
Panda-70M Open70.8M clips720P2024
VAST-27M Open27M clips2024
HD-VG-130M Open130M clips720P2023
InternVid Open234M clips720P2023
Youku-mPLUG Open10M clips2023
HD-VILA-100M Open103M clips720P2022
WebVid2M Open2.5M clips360P2021
YT-Temporal-180M Open180M clips2021
MSR-VTT Open10K clips240P2016
DiDeMo Open27K clips2017
LSMDC Movie118K clips1080P2017
MADMovie384K sentences 2022
UCF-101 Action13K clips240P2012
ActivityNet-200 Action100K clips720P2015
Kinetics Action306K clips2017
Charades Action10K videos2016
SS-V2 Action220K videos2017
HowTo100M Instruct136M clips240P2019
How2 Instruct80K clips2018
YouCook2 Cooking14K clips2018
Epic-Kitchens Cooking40K clips1080P2018
CelebV-Text Face70K clips480P2023
ShareGPT4Video OpenLarge-scale2024

Quantitative Metrics for Video Generation 9 metrics

MetricLevelDescriptionReference
FVDVideoFréchet Video Distance — measures distributional similarity of generated vs. real videos Unterthiner et al., ICLR 2019
KVDVideoKernel Video Distance — kernel-based alternative to FVD Unterthiner et al., ICLR 2019
Video ISVideoVideo Inception Score — extension of IS for temporal quality Saito et al., IJCV 2020
FCSVideoFrame Consistency Score — temporal coherence measure Wu et al., ICCV 2023
FMVDVideoFréchet Motion Video Distance — evaluates motion consistency arXiv 2024
FIDImageFréchet Inception Distance — per-frame quality Heusel et al., NeurIPS 2017
ISImageInception Score — quality and diversity Salimans et al., NeurIPS 2016
CLIP ScoreImageText-image semantic alignment via CLIP embeddings Radford et al., ICML 2021
PSNR / SSIMImagePeak signal-to-noise ratio / Structural similarity Wang et al., IEEE TIP 2004

Reinforcement Learning for Video Generation 7+ papers

ModelFull TitleVenueYear
VideoRewardVideoReward: A Universal Reward Model for Video Generation arXiv2025
T2V-TurboT2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design arXiv2025
DenseDPOFine-Grained Temporal Preference Optimization for Video Diffusion arXiv2025
LiFTLeveraging Human Feedback for Text-to-Video Model Alignment arXiv2024
VIDEOSCOREBuilding Automatic Metrics to Simulate Fine-grained Human Feedback arXiv2024
InstructVideoInstructing Video Diffusion Models with Human Feedback arXiv2023
Video Diffusion AlignmentVideo Diffusion Alignment via Reward Gradient arXiv2024
VideoAgentVideoAgent: Self-Improving Video Generation arXiv2025