InternVLA-M1
Spatially Guided Vision-Language-Action Framework for generalist robot policy. Shanghai AI Lab.
Overview
InternVLA-M1 uses a two-stage pipeline: (1) spatial grounding pre-training on 2.3M samples to determine "where to act," (2) spatially guided action post-training for "how to act." Modular, extensible, with dual supervision.
Benchmarks
- Google Robot 71.7% (WidowX), 76.0% (VM), 80.7% (VA)
- LIBERO 95.9% success
- +14.6% on SimplerEnv, +20.6% on unseen objects with synthetic co-training
Official Links
- internrobotics.github.io/internvla-m1 — Project site
- github.com/InternRobotics/InternVLA-M1 — Code (MIT)
- Hugging Face: InternRobotics — Models & datasets
Citation
See the project site for BibTeX and paper references.