Robot Learning Benchmarks

Standardized evaluation for robot manipulation — RLBench, LIBERO, CALVIN, and more. Success rates, task completion, evaluation metrics.

Live Data

Robotics Benchmark Leaderboards Live

Real-time rankings from Papers with Code, updated daily across manipulation, locomotion, and navigation tasks.

Collection

Simulation Benchmark Track

Task suites for reproducible simulation-first evaluation.

Collection

Real-Robot Evaluation

Benchmarks focused on embodied deployment and robustness.

Collection

Language-Conditioned Tasks

Benchmarks that stress instruction grounding and task composition.

Quick Browse

Popular Categories

Fast Tags

Popular Tags

Evaluation

Benchmarks for Manipulation

Filter benchmark suites by environment and evaluation focus.

Simulation

RLBench

100+ manipulation tasks in PyRep. Widely used for VLA evaluation. BridgeVLA 88.2%, InternVLA 95%+ on subsets.

View benchmark → Simulation

LIBERO

Lifelong learning benchmark. 130 tasks, spatial/object/goal suites. RoboSuite. 95.9% SOTA (InternVLA).

View benchmark → Simulation

CALVIN

Composing Actions from Language and Vision. Long-horizon, language-conditioned. RoboFlamingo strong baseline.

View benchmark → Real Robot

Google Robot Benchmark

Real-world manipulation. 700+ tasks. WidowX, various embodiments. Success rate, multi-task evaluation.

View benchmark → Real Robot

COLOSSEUM

Large-scale real-robot benchmark. Diverse tasks, environments. BridgeVLA 64%.

View benchmark →

Linked Assets

Suggested Models & Datasets

Comparable Metrics

Benchmarks are grouped for apples-to-apples performance checks.

Real vs Sim Coverage

Evaluate both controlled and deployment-oriented settings.

Model Mapping

Each benchmark path links to compatible model families.

Execution Support

Support for data capture and evaluation operations when needed.

Need Evaluation or Data for Your Benchmark?

We provide data collection and real-world evaluation support.

Data Services Contact Us