← 3D Vision

LLM-3D Understanding & Generation

Multi-modal large language models for 3D scene understanding, spatial reasoning, point cloud comprehension, embodied agents, and language-driven 3D generation.

⌘K

3D Understanding via LLM (2022–2026) 44+ papers

ModelInstitutePublicationYear
SpatialRGPT UCSDNeurIPS2024
LLaVA-3D HKUarXiv2025
Seg3D arXiv2025
3D-LLM UCLANeurIPS2023
PointLLM CUHKECCV2024
3D-LLaVAU of Adelaide CVPR2025
LEO BIGAIICML2024
GPT4Scene HKUarXiv2025
Robin3D HKUICCV2025
ShapeLLM XJTUarXiv2024
SpatialVLMGoogle DeepMind CVPR2024
Spatial-MLLM THUarXiv2025
MM-Spatial arXiv2025
Part-X-MLLM ICLR2026
3D-R1 PKUarXiv2025
LEO-VL BIGAIarXiv2025
Video-3D LLM CUHKCVPR2025
PerLA FBKCVPR2025
Chat-Scene ZJUNeurIPS2024
LL3DA FudanarXiv2023
Uni3D BAAIICLR2024
MiniGPT-3D HUSTACM MM2024
G2VLMShanghai AI Lab arXiv2025
Ross3D CASIAarXiv2025
SplatTalk GITarXiv2025
GreenPLM HUSTarXiv2024
Descrip3DDescrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions arXiv2025
VLM-3RVLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction arXiv2025
HMR3DHMR3D: Hierarchical Multimodal Representations for 3D Scene Understanding arXiv2025
UniVLGUniVLG: Unifying 2D and 3D Vision-Language Understanding and Grounding arXiv2025
Pts3D-LLMPts3D-LLM: 3D Point Cloud Features for Multimodal Large Language Models arXiv2025
Embodied-REmbodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in LLMs arXiv2025

3D Understanding via Foundation Models (CLIP, SAM) 28+ papers

ModelInstitutePublicationYear
OpenScene ETHzCVPR2023
LERF UC BerkeleyICCV2023
ConceptFusion MITRSS2023
OpenMask3D ETHNeurIPS2023
CLIP2Scene HKUCVPR2023
PLA HKUCVPR2023
Contrastive Lift Oxford-VGGNeurIPS2023
SAGA SJTUAAAI2025
Lexicon3D UIUCNeurIPS2024
CrossOver StanfordCVPR2025
Any2PointShanghai AI Lab ECCV2024
CoDA / CoDAv2HKUSTNeurIPS / TPAMI 2023–25
POMA-3D ImperialarXiv2025
Diff2Scene CMUECCV2024
3D-OVS NTUNeurIPS2023
Open-Vocab SAM3DOpen-Vocabulary SAM3D: Understand Any 3D Scene arXiv2025
Segment then SplatSegment then Splat: Open-Vocabulary Segmentation on Gaussian Splatting NeurIPS2025
Semantic GaussiansSemantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting ECCV2024

3D Reasoning 8+ papers

ModelInstitutePublicationYear
Situation3D UIUCCVPR2024
MSR3D BIGAINeurIPS2024
3D-CLR UCLACVPR2023
Transcribe3D TTI ChicagoCoRL2023
RoboTracer BUAAarXiv2025
RoboRefer BUAAarXiv2025
SceneCOT BIGAIarXiv2025

LLM-Driven 3D Generation 15+ papers

ModelInstitutePublicationYear
LLaMA-MeshTHU / NVIDIA arXiv2024
MeshGPT TUMarXiv2023
ShapeGPT FudanarXiv2023
3D-GPT ANUarXiv2023
DreamLLM MEGVIIarXiv2023
ChatAvatar Deemos TechACM TOG2023
LLMR MITarXiv2023
MeshGPT-2MeshGPT-2: Scalable Autoregressive 3D Mesh Generation arXiv2025
MeshAnythingMeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers arXiv2024
CAD-GPTCAD-GPT: Synthesising CAD Construction Sequences with Spatial Reasoning-Enhanced Multimodal LLMs NeurIPS2024
CG-MLLMCG-MLLM: A Multi-modal Large Language Model for 3D Captioning and Generation arXiv2026
Ex-OmniEx-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models arXiv2026

3D Embodied Agents 15+ papers

ModelInstitutePublicationYear
RT-2Google DeepMind arXiv2023
RT-1 GooglearXiv2022
VoxPoser StanfordarXiv2023
SayPlan QUTCoRL2023
NaviLLM CUHKCVPR2024
VeBrainShanghai AI Lab arXiv2025
3DLLM-Mem UCLA/GoogleNeurIPS2025
UniHSIShanghai AI Lab arXiv2023
Dobb-E NYUarXiv2023
LLM-Planner Ohio StateICCV2023
NLMap-SayCan GoogleICRA2023

3D Benchmarks 20+ benchmarks

BenchmarkInstitutePublicationYear
ScanQA RIKEN AIPCVPR2023
SQA3D BIGAIICLR2023
ScanRefer TUMECCV2020
EmbodiedScanShanghai AI Lab arXiv2023
SceneVerse BIGAIECCV2024
MMScanShanghai AI Lab arXiv2024
3D-GRAND UMicharXiv2024
Reason3D UC Merced3DV2025
M3DBench FudanarXiv2023
Space3D-Bench ETHzarXiv2024
SpaCE-10 SJTUarXiv2025
Hypo3D ImperialICML2025
Beacon3D BIGAICVPR2025
SPAR FudanarXiv2025
Anywhere3D BIGAINeurIPS2025