Benchmarks, Datasets & Metrics

Evaluation frameworks, large-scale video-text datasets, and established quantitative metrics for assessing video generation quality.

⌘K

Video Generation Benchmarks (2023–2025) 16+ benchmarks

Benchmark	Focus / Description	Venue	Year
Demo-ICL-Bench	Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition	arXiv	2026
GenArena	GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?	arXiv	2026
Video-o3	Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning	arXiv	2026
Diffusion-DRF	Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning	arXiv	2026
RISE-Video	RISE-Video: Can Video Generators Decode Implicit World Rules?	arXiv	2026
T2VEval	T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos	arXiv	2026
VideoVerse	VideoVerse: How Far is Your T2V Generator from a World Model?	arXiv	2025
VideoScore	VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation	arXiv	2025
TC-Bench	TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation	arXiv	2025
ConsistBench	ConsistBench: Benchmarking Consistency in Text-to-Video Models	arXiv	2025
VideoAutoArena	VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis	arXiv	2025
VBench-2.0	VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness	arXiv	2025
Stable Cinemetrics	Structured Taxonomy and Evaluation for Professional Video Generation	arXiv	2025
OpenS2V-Nexus	Detailed Benchmark and Million-Scale Dataset for Subject-to-Video	arXiv	2025
Impossible Videos	Physically impossible video evaluation for world models	arXiv	2025
MEt3R	Measuring Multi-View Consistency in Generated Images	arXiv	2025
T2V-CompBench	Comprehensive Benchmark for Compositional Text-to-video Generation	arXiv	2024
ChronoMagic-Bench	Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video	NeurIPS	2024
T2VScore	Towards A Better Metric for Text-to-Video Generation	arXiv	2024
EvalCrafter	Benchmarking and Evaluating Large Video Generation Models	CVPR	2024
VBench	Comprehensive Benchmark Suite for Video Generative Models	arXiv	2023
FETV	Benchmark for Fine-Grained Evaluation of Open-Domain T2V	arXiv	2023
StoryBench	Multifaceted Benchmark for Continuous Story Visualization	NeurIPS	2023
PEEKABOO	Interactive Video Generation via Masked-Diffusion	CVPR	2024
PhyGenBench	World Simulator: Crafting Physical Commonsense-Based Benchmark	arXiv	2024
VideoPhy	Evaluating Physical Commonsense for Video Generation	arXiv	2024
GenAI-Bench	Evaluating Text-to-Visual Generation with Image-to-Text Generation	arXiv	2024
VidProM	Million-scale Real Prompt-Gallery Dataset for T2V Diffusion Models	arXiv	2024
VBench++	VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models	arXiv	2024
GameGen-Bench	PhyGenBench: Crafting Physical Commonsense-Based Benchmark for Video Generation	arXiv	2025

Large-Scale Video-Text Datasets 28+ datasets

Dataset	Domain	Scale	Resolution	Year
MiraData	Open-Domain / General	330K clips	4K	2025
OpenVid-1M	Open-Domain	1M clips	HD	2024
InternVid-G	General	7M clips	HD	2025
OpenS2V-Nexus	Open	5.4M clips	720P	2025
ChronoMagic-Pro	Open	460K videos	720P	2024
Panda-70M	Open	70.8M clips	720P	2024
VAST-27M	Open	27M clips	—	2024
HD-VG-130M	Open	130M clips	720P	2023
InternVid	Open	234M clips	720P	2023
Youku-mPLUG	Open	10M clips	—	2023
HD-VILA-100M	Open	103M clips	720P	2022
WebVid2M	Open	2.5M clips	360P	2021
YT-Temporal-180M	Open	180M clips	—	2021
MSR-VTT	Open	10K clips	240P	2016
DiDeMo	Open	27K clips	—	2017
LSMDC	Movie	118K clips	1080P	2017
MAD	Movie	384K sentences	—	2022
UCF-101	Action	13K clips	240P	2012
ActivityNet-200	Action	100K clips	720P	2015
Kinetics	Action	306K clips	—	2017
Charades	Action	10K videos	—	2016
SS-V2	Action	220K videos	—	2017
HowTo100M	Instruct	136M clips	240P	2019
How2	Instruct	80K clips	—	2018
YouCook2	Cooking	14K clips	—	2018
Epic-Kitchens	Cooking	40K clips	1080P	2018
CelebV-Text	Face	70K clips	480P	2023
ShareGPT4Video	Open	Large-scale	—	2024

Quantitative Metrics for Video Generation 9 metrics

Metric	Level	Description	Reference
FVD	Video	Fréchet Video Distance — measures distributional similarity of generated vs. real videos	Unterthiner et al., ICLR 2019
KVD	Video	Kernel Video Distance — kernel-based alternative to FVD	Unterthiner et al., ICLR 2019
Video IS	Video	Video Inception Score — extension of IS for temporal quality	Saito et al., IJCV 2020
FCS	Video	Frame Consistency Score — temporal coherence measure	Wu et al., ICCV 2023
FMVD	Video	Fréchet Motion Video Distance — evaluates motion consistency	arXiv 2024
FID	Image	Fréchet Inception Distance — per-frame quality	Heusel et al., NeurIPS 2017
IS	Image	Inception Score — quality and diversity	Salimans et al., NeurIPS 2016
CLIP Score	Image	Text-image semantic alignment via CLIP embeddings	Radford et al., ICML 2021
PSNR / SSIM	Image	Peak signal-to-noise ratio / Structural similarity	Wang et al., IEEE TIP 2004

Reinforcement Learning for Video Generation 7+ papers

Model	Full Title	Venue	Year
VideoReward	VideoReward: A Universal Reward Model for Video Generation	arXiv	2025
T2V-Turbo	T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design	arXiv	2025
DenseDPO	Fine-Grained Temporal Preference Optimization for Video Diffusion	arXiv	2025
LiFT	Leveraging Human Feedback for Text-to-Video Model Alignment	arXiv	2024
VIDEOSCORE	Building Automatic Metrics to Simulate Fine-grained Human Feedback	arXiv	2024
InstructVideo	Instructing Video Diffusion Models with Human Feedback	arXiv	2023
Video Diffusion Alignment	Video Diffusion Alignment via Reward Gradient	arXiv	2024
VideoAgent	VideoAgent: Self-Improving Video Generation	arXiv	2025