Current Ideas in Spatial Understanding
What is good, what is not?
Computer Vision
LLMs
Papers
Robotics
A collated set of current ideas.
This post was updated with a new paper on Friday, 18 July 2025. This post was updated with a new paper on Monday, 21 July 2025.
I went through a bunch of papers I found relating to enhancing spatial understanding in VLMs to get an idea of what’s going on. Here, I’ve simply extracted the main idea that was explored. I haven’t evaluated the worthiness/effectiveness of the mentioned ideas.
From what I’ve seen, I’ve noticed the following issues with VLMs:
- The encoders themselves don’t understand space
- VLMs are langauge biased
- VLMs place attention at the wrong places
- The manner of training is important (i.e,. the dataset provided and the way predictions are made)
3D Understanding and Reconstruction
- 3D geometric encoding for monocular images: Use a 3D geometric encoder to produce 3D tokens from monocular images (RGB only images; no depth map). VLM-3R: Vision language models, augmented with instruction aligned 3D reconstruction
- Evaluating 3D awareness in image models: Explores whether image models can “see” in 3D. Finds self-supervised models are good at single-view awareness but fail at multiview awareness. Probing the 3D awareness of Visual Foundation Models
- CLIP-based 3D-text embedding mapping: Using CLIP, create an embedding that maps text to 3D point clouds. Unified Representation Space for 3D Visual Grounding
- Dual-encoder system for 2D semantics and 3D structure: Enhances spatial understanding with a 2D encoder (for visual semantics) and a 3D encoder (for 3D structure), eliminating the need for 3D data like depth maps, and “intelligently” selecting key frames. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
- Intermediate perception token generation: Trains a VLM to output intermediate perception tokens (e.g., depth tokens) to facilitate its reasoning process. Perception tokens and enhance visual reasoning in multimodal language models
- Enhanced CLIP with intermediate layers and video support: A stronger CLIP variant that works on video, uses intermediate layers, scales well, aligns for VL and spatial tasks, and includes a dataset. Perception Encoder: The best visual embeddings are not at the output of the network
- View-aware spatial reasoning for vision-language models: Enables vision-language models to answer spatial questions about 3D scenes by generating and using imagined views from different perspectives, paired with a world model for view synthesis. Mental Exploration: Augmenting Vision-Language Models with View Synthesis for Spatial Reasoning
Spatial Reasoning Enhancements
- Incorporating horizontal lines and row-wise scanning: Introduce horizontal lines to input images, and prompt to scan row-by-row to improve spatial performance. Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
- Implicit Perception Loss for policy optimization: Extends GRPO by introducing “Implicit Perception Loss” to encourage the model to rely more on visual input. Perception-Aware Policy Optimization for Multimodal Reasoning
- Latent visual token-based reasoning: Enables VLMs to “imagine” scenes using latent visual tokens (without outputting pixels). Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Datasets and Benchmarks for Spatial Understanding
- Spatial detail-rich QnA dataset finetuning: Mixes existing VLMs with a custom image QnA dataset containing object details (presence, depth, 3D positions). SpatialVLM: Endowing VLMs with Spatial Reasoning Capabilities
- Spatial relationship dataset for robotics: A dataset to help VLMs learn spatial relationships better. RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
- Simulated dynamic spatial dataset: A dataset produced from simulations to help VLMs improve understanding of space and movement. SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
- Multi-modal spatial perception benchmark: Tests how well VLMs understand objects in space and their relation to other images. MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence
- Multi-image spatial intelligence benchmark: Evaluates VLM spatial understanding using multiple images of the same scene. MMSI-Bench: A benchmark for multi-image spatial intelligence
- Spatial intelligence evaluation framework: A benchmark to test VLM spatial intelligence. SITE: Towards spatial intelligence through evaluation
- Multi-view understanding evaluation: A benchmark to assess VLM performance across varied angles of the same scene, finding VLMs struggle with this. Seeing from Another Perspective: Evaluating Mutli-View Understanding in MLLMs
- Hierarchical spatial weakness evaluation: A framework to identify VLM weaknesses, finding VLMs struggle with depth perception, view switching (first/third-person), and physical reasoning (e.g., occlusion). SPHERE: Unveiling Spatial Blindspots in VLMs through Hierarchical Evaluation
- Depth and object annotation dataset for 3D finetuning: Creates a custom dataset to finetune a VLM on depth images and object annotations. MM-Spatial: Exploring 3D Spatial Understanding in MLLMs
- Depth map-annotated dataset finetuning: Finetunes a VLM on a custom dataset of annotated RGB images + depth maps. SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
- RGBD QnA dataset finetuning: Finetunes a VLM on a custom dataset of annotated RGB images + depth maps + QnA. SpatialBot: Precise Spatial Understanding with Vision Language Models
Attention Mechanisms and Visual Feature Utilization
- Attention logit temperature adjustment: Guides attention to relevant image parts by increasing the temperature of attention logits in confident answers. Why is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
- Improving visual feature utilization: Highlights that VLM vision encoders outperform the VLM itself, as VLMs fail to effectively use vision encoder outputs, relying instead on language biases. Hidden in plain sight: VLMs overlook their visual representations
- Teacher-student training with refiner modules: Refines CLIP’s spatial awareness using a teacher/student model and a refiner module to remove semantic contamination (noise from irrelevant context). Refining CLIP’s spatial awareness: A Visual-centric Perspective
Multimodal and Temporal Spatial Reasoning
- Temporal event prediction and segmentation: Teaches a model to infer missing parts in video events via cause and effect, and to split videos into non-overlapping events with detailed timestamps. Tempura: Temporal event masked prediction and understanding for reasoning in action
- Multiframe depth and correspondence learning: Improves VLMs’ understanding of spatial relationships between frames by teaching depth perception, visual correspondence, and dynamic perception. Multi-SpatialMLLM: Multiframe Spatial Understanding with MLLMs
- Human motion and video finetuning: Finetunes a VLM on videos of human actions + skeleton sequences or 3D human movement models, enabling it to output text describing motions. MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Novel Prompting and Interaction Methods
- Iterative visual prompting (“hot and cold”): Elicits actionable knowledge from VLMs by having them iteratively select from the best visual suggestions (e.g., arrows/markers on images) instead of outputting actions directly. PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
- VLM cropping (scanning-like behavior): A technique reminiscent of CNN scanning. VLM Cropping
Cognitive and Mental Models of Space
- Cognitive map exploration: Explores “cognitive maps” as a way for VLMs to visualize and encode space, studying how they see, remember, and recall spaces. Thinking in Space: How MLLMs See, Remember, and Recall Spaces
- Latent visual token-based imagination: Enables VLMs to “imagine” scenes using latent visual tokens (without outputting pixels). Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens
Conclusion
If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!