SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
Abstract
Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations.
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.
Community
🔢 Embodied AI cannot avoid numbers.
From angles to distances and coordinates, numbers are everywhere in perception and action.
But are current VLMs ready for that?
In SᴘᴀᴄᴇNᴜᴍ, we revisit this spatial numerical understanding capability!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning (2026)
- SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes (2026)
- Why Far Looks Up: Probing Spatial Representation in Vision-Language Models (2026)
- Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models (2026)
- SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs (2026)
- Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models (2026)
- Learning to Draw ASCII Improves Spatial Reasoning in Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.23898 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper