BridgeVLA

Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models.

Overview

BridgeVLA pre-trains a VLM backbone to take 2D images as input and produce 2D heatmaps as output, then fine-tunes while projecting point clouds into multi-view images. Enables efficient 3D manipulation with minimal data.

Benchmarks

RLBench 88.2% (up from 81.4%)
COLOSSEUM 64.0%
10+ tasks 95.4% with only 3 trajectories per task

Official Links

bridgevla.github.io — Project site
OpenReview — NeurIPS 2025 paper

Citation

NeurIPS 2025. See the project site for BibTeX.