1Galbot
2Peking University
3The University of Hong Kong
4Institute of Automation, Chinese Academy of Sciences
5Beijing Academy of Artificial Intelligence
6Xiamen University Malaysia
† corresponding author
StereoVLA demonstrates strong robustness to different camera viewpoints, maintaining reliable performance across diverse spatial configurations.
StereoVLA leverages stereo geometric cues to achieve precise spatial perception, enabling manipulation of thin and small objects.
Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting, and demonstrates strong robustness to camera pose variations.
We provide real-world zero-shot execution videos for StereoVLA to show its precise perception, manipulation, and robustness.
In StereoVLA, a stereo image pair is encoded by the Geometric-Semantic Feature Extraction module to generate visual tokens with geometric precision and semantic richness. Together with language tokens, they are processed by a large language model backbone (InternLM-1.8B). An action expert predicts delta end-effector poses, while an auxiliary depth estimation task further enhances geometry learning during training. The Geometric-Semantic Feature Extraction module extracts geometric features with FoundationStereo (bypassing disparity prediction components for efficiency) and semantic-rich features with SigLIP and DINOv2, then fuses them into a unified visual representation with an MLP projector.
If you have any questions, please feel free to contact Shengliang Deng at sldeng@cs.hku.hk, Mi Yan at dorisyan@pku.edu.cn and Yixin Zheng at zhengyixin2025@ia.ac.cn.