VLA & VLM
Vision-Language-Action and Vision-Language Models — language-conditioned robot control.
What Are VLA and VLM?
VLM (Vision-Language Model) — Multimodal models that understand both images and text. Used for captioning, VQA, and grounding.
VLA (Vision-Language-Action) — VLMs extended to output robot actions. Take images + language instructions, output control commands (e.g., joint positions, gripper). Enable "pick up the red block" style control.
Key Models
- OpenVLA — 7B open-source VLA, 970K demos
- RT-2 / RT-X — Google's VLA family
- Octo — Diffusion policy with language conditioning
- RoboFlamingo — OpenFlamingo-based VLM for robots
Related Resources
- Open-Source VLA & VLM Models — Full catalog with links
- Datasets — Language-labeled manipulation data