← Models

BridgeVLA

Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models.

Overview

BridgeVLA pre-trains a VLM backbone to take 2D images as input and produce 2D heatmaps as output, then fine-tunes while projecting point clouds into multi-view images. Enables efficient 3D manipulation with minimal data.

Benchmarks

  • RLBench 88.2% (up from 81.4%)
  • COLOSSEUM 64.0%
  • 10+ tasks 95.4% with only 3 trajectories per task

Official Links

Citation

NeurIPS 2025. See the project site for BibTeX.