StereoVLA: Enhancing Vision-Language-Action Models
with Stereo Vision


Shengliang Deng1,3*    Mi Yan1,2*    Yixin Zheng1,4,5*    Jiayi Su1,6    WenHao Zhang1,2    Xiaoguang Zhao4   
Heming Cui3    Zhizheng Zhang1,5†    He Wang1,2,5†

1Galbot    2Peking University    3The University of Hong Kong    4Institute of Automation, Chinese Academy of Sciences    5Beijing Academy of Artificial Intelligence    6Xiamen University Malaysia   

corresponding author  

Highlights


💪 Robustness to Camera Pose Variations

StereoVLA demonstrates strong robustness to different camera viewpoints, maintaining reliable performance across diverse spatial configurations.

3x speed
model
input
3x speed
3x speed
model
input
3x speed
Camera pose randomization range
🚀 Precise Manipulation of Thin and Small Objects

StereoVLA leverages stereo geometric cues to achieve precise spatial perception, enabling manipulation of thin and small objects.

3x speed
Thin and small objects manipulation



Abstract


Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks under the stereo setting, and demonstrates strong robustness to camera pose variations.


Zero-Shot Evaluation


We provide real-world zero-shot execution videos for StereoVLA to show its precise perception, manipulation, and robustness.

1. StereoVLA is able to perform precise pick and place tasks.
3x speed
3x speed
3x speed
3x speed
3x speed


2. StereoVLA is able to manipulate thin and small objects that require precise spatial perception.
3x speed
Thin and small objects manipulation


3. StereoVLA is able to manipulate thin and small objects that require precise spatial perception.
3x speed
3x speed
3x speed
3x speed

4. StereoVLA is insensitive to camera pose variations
3x speed
model
input
3x speed
3x speed
model
input
3x speed

5. StereoVLA is capable of pick-and-place skills involving diverse objects under different backgrounds.
3x speed
3x speed
3x speed
3x speed

6. StereoVLA is robust to dynamic distractions and lighting variations.
3x speed
3x speed
3x speed
3x speed

Model


In StereoVLA, a stereo image pair is encoded by the Geometric-Semantic Feature Extraction module to generate visual tokens with geometric precision and semantic richness. Together with language tokens, they are processed by a large language model backbone (InternLM-1.8B). An action expert predicts delta end-effector poses, while an auxiliary depth estimation task further enhances geometry learning during training. The Geometric-Semantic Feature Extraction module extracts geometric features with FoundationStereo (bypassing disparity prediction components for efficiency) and semantic-rich features with SigLIP and DINOv2, then fuses them into a unified visual representation with an MLP projector.


Contact


If you have any questions, please feel free to contact Shengliang Deng at sldeng@cs.hku.hk, Mi Yan at dorisyan@pku.edu.cn and Yixin Zheng at zhengyixin2025@ia.ac.cn.