Robot Learning Benchmarks

Standardized evaluation for robot manipulation — RLBench, LIBERO, CALVIN, and more. Success rates, task completion, evaluation metrics.

Task suites for reproducible simulation-first evaluation.

Benchmarks focused on embodied deployment and robustness.

Benchmarks that stress instruction grounding and task composition.

Quick Browse

Popular Categories

Fast Tags

Evaluation

Filter benchmark suites by environment and evaluation focus.

100+ manipulation tasks in PyRep. Widely used for VLA evaluation. BridgeVLA 88.2%, InternVLA 95%+ on subsets.

Lifelong learning benchmark. 130 tasks, spatial/object/goal suites. RoboSuite. 95.9% SOTA (InternVLA).

Composing Actions from Language and Vision. Long-horizon, language-conditioned. RoboFlamingo strong baseline.

Real-world manipulation. 700+ tasks. WidowX, various embodiments. Success rate, multi-task evaluation.

Large-scale real-robot benchmark. Diverse tasks, environments. BridgeVLA 64%.

Linked Assets

Benchmarks are grouped for apples-to-apples performance checks.

Evaluate both controlled and deployment-oriented settings.

Each benchmark path links to compatible model families.

Support for data capture and evaluation operations when needed.

We provide data collection and real-world evaluation support.

Data Services Contact Us