BridgeVLA
Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models.
Overview
BridgeVLA pre-trains a VLM backbone to take 2D images as input and produce 2D heatmaps as output, then fine-tunes while projecting point clouds into multi-view images. Enables efficient 3D manipulation with minimal data.
Benchmarks
- RLBench 88.2% (up from 81.4%)
- COLOSSEUM 64.0%
- 10+ tasks 95.4% with only 3 trajectories per task
Official Links
- bridgevla.github.io — Project site
- OpenReview — NeurIPS 2025 paper
Citation
NeurIPS 2025. See the project site for BibTeX.