| Demo-ICL-Bench | Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition | arXiv | 2026 |
| GenArena | GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks? | arXiv | 2026 |
| Video-o3 | Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning | arXiv | 2026 |
| Diffusion-DRF | Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning | arXiv | 2026 |
| RISE-Video | RISE-Video: Can Video Generators Decode Implicit World Rules? | arXiv | 2026 |
| T2VEval | T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos | arXiv | 2026 |
| VideoVerse | VideoVerse: How Far is Your T2V Generator from a World Model? | arXiv | 2025 |
| VideoScore | VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation | arXiv | 2025 |
| TC-Bench | TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation | arXiv | 2025 |
| ConsistBench | ConsistBench: Benchmarking Consistency in Text-to-Video Models | arXiv | 2025 |
| VideoAutoArena | VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis | arXiv | 2025 |
| VBench-2.0 | VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness | arXiv | 2025 |
| Stable Cinemetrics | Structured Taxonomy and Evaluation for Professional Video Generation | arXiv | 2025 |
| OpenS2V-Nexus | Detailed Benchmark and Million-Scale Dataset for Subject-to-Video | arXiv | 2025 |
| Impossible Videos | Physically impossible video evaluation for world models | arXiv | 2025 |
| MEt3R | Measuring Multi-View Consistency in Generated Images | arXiv | 2025 |
| T2V-CompBench | Comprehensive Benchmark for Compositional Text-to-video Generation | arXiv | 2024 |
| ChronoMagic-Bench | Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video | NeurIPS | 2024 |
| T2VScore | Towards A Better Metric for Text-to-Video Generation | arXiv | 2024 |
| EvalCrafter | Benchmarking and Evaluating Large Video Generation Models | CVPR | 2024 |
| VBench | Comprehensive Benchmark Suite for Video Generative Models | arXiv | 2023 |
| FETV | Benchmark for Fine-Grained Evaluation of Open-Domain T2V | arXiv | 2023 |
| StoryBench | Multifaceted Benchmark for Continuous Story Visualization | NeurIPS | 2023 |
| PEEKABOO | Interactive Video Generation via Masked-Diffusion | CVPR | 2024 |
| PhyGenBench | World Simulator: Crafting Physical Commonsense-Based Benchmark | arXiv | 2024 |
| VideoPhy | Evaluating Physical Commonsense for Video Generation | arXiv | 2024 |
| GenAI-Bench | Evaluating Text-to-Visual Generation with Image-to-Text Generation | arXiv | 2024 |
| VidProM | Million-scale Real Prompt-Gallery Dataset for T2V Diffusion Models | arXiv | 2024 |
| VBench++ | VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | arXiv | 2024 |
| GameGen-Bench | PhyGenBench: Crafting Physical Commonsense-Based Benchmark for Video Generation | arXiv | 2025 |