Interactive World Simulator for Robot Policy Training and Evaluation

Interactive World Simulator supports stable long-horizon interactions for more than 10 minutes at 15 FPS, which unlock two critical robot learning applications: scalable data generation for policy training and faithful policy evaluation.

Interactive Demo

Our world model is an action-conditioned video prediction model: given the initial frame and a sequence of actions, it predicts future frames purely in pixel space, without any physics engine. Here you can interact with it directly. Select one of the four tasks below, click Connect, and use your keyboard to control the robot end-effectors. Note: the cloud server may be laggy or unavailable due to resource limits. For a smooth, low-latency experience, consider running the server locally and using Connect Locally.

Left Arm

Q

W

E

A

S

D

AD: X-axis, WS: Y-axis

● LIVE

Right Arm / Gripper

U

I

O

J

K

L

IJKL · XY

TL;DR

Interactive World Simulator: World models that support stable long-horizon interactions for more than 10 minutes at 15 FPS.

Scalable Data Generation for Policy Training: Generate data with quality comparable to real-world data for policy training.

Faithful Rollout for Policy Evaluation: Our interactive world simulator serves as a faithful evaluation engine for policies, demonstrating strong correlation with real-world performance and world-model performance.

Walkthrough Video

Interactive World Model

Our Interactive World Simulator supports diverse robot manipulation tasks, ranging from rigid-body contact (T pushing) to deformable object manipulation (rope routing) and object pile manipulation (sweeping objects into a bin). Each video below is a live interaction from a teleoperation device. Note that the videos on the monitor and the left side are from open-loop live prediction. They are NOT from real cameras. The frames are generated entirely by our interactive world simulators in response to action sequences.

T Pushing (Rigid Object)

Our interactive world simulator can understand rigid interactions between T and grippers, which is purely learned from interaction data without any physics priors.

Rope Routing (Deformable Object)

Deformable-rigid object interaction is challenging to model. Our model can accurately understand both cases when the rope is inserted into the clip and when the rope swings outside without making real contact. Our model does not hallucinate or bias toward either case. Instead, it faithfully captures the complex dynamics of both scenarios.

Mug Grasping (Fine-Grained Dynamics for Rigid Object)

Our model can understand very fine-grained dynamics of rigid object interaction, such as the mug sliding out of the gripper and the mug handle being pushed and rotated.

Pile Sweeping (Object Pile)

Our model can capture the complex dynamics of object pile manipulation, such as the object pile being swept into the bin.

Scalable Data Generation for Policy Training

Collecting real-world robot demonstrations is expensive and time-consuming. Our world simulator enables data collection using only teleoperation devices. But the question is: can generated demonstrations actually substitute for real-world demonstrations when training imitation learning policies? Our answer is YES!

We train imitation learning policies using 100% generated data and 0% real data. The following videos are autonomous imitation policy rollouts in the real world. We show that we can train imitation policies using pure simulation data, that are robust under continuous perturbations.

T Pushing

Rope Routing

Pile Sweeping

Mug Grasping

To quantitatively study the generated data quality, we benchmark four imitation learning policies, including DP, ACT, π₀, and π_0.5, across five manipulation tasks: T pushing (in MuJoCo simulation and in the real world), pile sweeping, mug grasping, and rope routing. For all experiments, we fix the total dataset size at 100 demonstration episodes and vary the proportion of generated data versus real-world data in the training set, sweeping across the full spectrum from 0% to 100% world-model data. We find that performance remains comparable across the entire data mixture spectrum, indicating that our world-model data quality is comparable to real-world data for imitation policy training.

Faithful Policy Evaluation

Our world model can serve as a scalable, reproducible, and faithful policy evaluator. The following qualitative videos show that a good policy that succeeds in our world simulator also succeeds on the real robot, while a bad policy that fails in the world model fails in the real world too. Different policies, both good ones and bad ones, show similar behavior in the world model and in the real world, demonstrating strong correlation between our world model and the real world.

Real World World Model

T Pushing (Bad Policy)

Real World

World Model

T Pushing (Good Policy)

Real World

World Model

Rope Routing (Bad Policy)

Real World

World Model

Rope Routing (Good Policy)

Real World

World Model

Pile Sweeping (Bad Policy)

Real World

World Model

Pile Sweeping (Good Policy)

Real World

World Model

Mug Grasping (Bad Policy)

Real World

World Model

Mug Grasping (Good Policy)

Real World

World Model

To quantitatively verify this, we evaluate four policies (DP, ACT, π₀, π_0.5) across four tasks, measuring task scores both inside the world model and on the real robot, under identical conditions across all policies and checkpoints. We make sure comparisons are fair by ensuring the same initial configurations within both real world and world model. In the following figure, each point corresponds to a policy checkpoint evaluated under identical settings. We observe strong positive correlations, indicating that evaluation within our world simulator faithfully reflects relative policy performance in the real world.

10-Minute Interaction

Long-horizon stability is one of the hardest problems for video-prediction world models due to accumulated errors. Our world model runs open-loop for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. The videos below show live interaction from a teleoperation device. Note that videos on the monitor and the left side are from open-loop live prediction. They are NOT from real cameras.