4D Spatial Intelligence

A comprehensive survey spanning monocular depth estimation and camera pose recovery through dense 3D/4D tracking, dynamic scene reconstruction, human-centric motion capture, and physics-based simulation — encompassing the full landscape of reconstructing spatial intelligence from video.

Reconstructing the 4D world — three spatial dimensions plus time — from video has emerged as a central challenge in computer vision. The field is structured along a hierarchy of spatial complexity: from low-level depth and camera pose estimation to dense 3D point tracking, from static 3D scene reconstruction at small and large scales to full 4D dynamic scene modeling via deformable NeRFs and 4D Gaussian Splatting. Human-centric 4D capture — encompassing SMPL-based mesh recovery, egocentric motion tracking, and appearance-rich avatar modeling — enables rich understanding of people in motion, while physics-based simulation grounds these reconstructions in physically plausible dynamics for character control, human-object interaction, and scene understanding. This pillar covers 550+ papers organized across six sub-domains, drawing from the Awesome-4D-Spatial-Intelligence survey and the latest research from arXiv.