Interactive World Simulator
for Robot Policy Training and Evaluation

1Columbia University  2Toyota Research Institute  3Amazon  4University of Illinois Urbana-Champaign
Columbia University Toyota Research Institute Amazon University of Illinois Urbana-Champaign

Interactive World Simulator supports stable long-horizon interactions for more than 10 minutes at 15 FPS, which unlock two critical robot learning applications: scalable data generation for policy training and faithful policy evaluation.

Interactive Demo

Our world model is an action-conditioned video prediction model: given the initial frame and a sequence of actions, it predicts future frames purely in pixel space, without any physics engine. Here you can interact with it directly. Select one of the four tasks below, click Connect, and use your keyboard to control the robot end-effectors. Note: the cloud server may be laggy or unavailable due to resource limits. For a smooth, low-latency experience, consider running the server locally and using Connect Locally.

Disconnected
Latency  ms
Left Arm
Q
W
E
A
S
D
AD: X-axis, WS: Y-axis
● LIVE
Right Arm / Gripper
U
I
O
J
K
L
IJKL · XY

TL;DR

  • Interactive World Simulator: World models that support stable long-horizon interactions for more than 10 minutes at 15 FPS.
  • Scalable Data Generation for Policy Training: Generate data with quality comparable to real-world data for policy training.
  • Faithful Rollout for Policy Evaluation: Our interactive world simulator serves as a faithful evaluation engine for policies, demonstrating strong correlation with real-world performance and world-model performance.
  • Walkthrough Video

    Interactive World Model

    Our Interactive World Simulator supports diverse robot manipulation tasks, ranging from rigid-body contact (T pushing) to deformable object manipulation (rope routing) and object pile manipulation (sweeping objects into a bin). Each video below is a live interaction from a teleoperation device. Note that the videos on the monitor and the left side are from open-loop live prediction. They are NOT from real cameras. The frames are generated entirely by our interactive world simulators in response to action sequences.

    T Pushing (Rigid Object)    

    Our interactive world simulator can understand rigid interactions between T and grippers, which is purely learned from interaction data without any physics priors.

    Rope Routing (Deformable Object)    

    Deformable-rigid object interaction is challenging to model. Our model can accurately understand both cases when the rope is inserted into the clip and when the rope swings outside without making real contact. Our model does not hallucinate or bias toward either case. Instead, it faithfully captures the complex dynamics of both scenarios.

    Mug Grasping (Fine-Grained Dynamics for Rigid Object)    

    Our model can understand very fine-grained dynamics of rigid object interaction, such as the mug sliding out of the gripper and the mug handle being pushed and rotated.

    Pile Sweeping (Object Pile)    

    Our model can capture the complex dynamics of object pile manipulation, such as the object pile being swept into the bin.

    Scalable Data Generation for Policy Training

    Collecting real-world robot demonstrations is expensive and time-consuming. Our world simulator enables data collection using only teleoperation devices. But the question is: can generated demonstrations actually substitute for real-world demonstrations when training imitation learning policies? Our answer is YES!

    We train imitation learning policies using 100% generated data and 0% real data. The following videos are autonomous imitation policy rollouts in the real world. We show that we can train imitation policies using pure simulation data, that are robust under continuous perturbations.

    T Pushing

    Rope Routing

    Pile Sweeping

    Mug Grasping



    To quantitatively study the generated data quality, we benchmark four imitation learning policies, including DP, ACT, π0, and π0.5, across five manipulation tasks: T pushing (in MuJoCo simulation and in the real world), pile sweeping, mug grasping, and rope routing. For all experiments, we fix the total dataset size at 100 demonstration episodes and vary the proportion of generated data versus real-world data in the training set, sweeping across the full spectrum from 0% to 100% world-model data. We find that performance remains comparable across the entire data mixture spectrum, indicating that our world-model data quality is comparable to real-world data for imitation policy training.



    Faithful Policy Evaluation

    Our world model can serve as a scalable, reproducible, and faithful policy evaluator. The following qualitative videos show that a good policy that succeeds in our world simulator also succeeds on the real robot, while a bad policy that fails in the world model fails in the real world too. Different policies, both good ones and bad ones, show similar behavior in the world model and in the real world, demonstrating strong correlation between our world model and the real world.

    Real World World Model

    T Pushing (Bad Policy)

    Real World

    World Model

    T Pushing (Good Policy)

    Real World

    World Model

    Rope Routing (Bad Policy)

    Real World

    World Model

    Rope Routing (Good Policy)

    Real World

    World Model

    Pile Sweeping (Bad Policy)

    Real World

    World Model

    Pile Sweeping (Good Policy)

    Real World

    World Model

    Mug Grasping (Bad Policy)

    Real World

    World Model

    Mug Grasping (Good Policy)

    Real World

    World Model

    To quantitatively verify this, we evaluate four policies (DP, ACT, π0, π0.5) across four tasks, measuring task scores both inside the world model and on the real robot, under identical conditions across all policies and checkpoints. We make sure comparisons are fair by ensuring the same initial configurations within both real world and world model. In the following figure, each point corresponds to a policy checkpoint evaluated under identical settings. We observe strong positive correlations, indicating that evaluation within our world simulator faithfully reflects relative policy performance in the real world.

    10-Minute Interaction

    Long-horizon stability is one of the hardest problems for video-prediction world models due to accumulated errors. Our world model runs open-loop for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. The videos below show live interaction from a teleoperation device. Note that videos on the monitor and the left side are from open-loop live prediction. They are NOT from real cameras.

    T Pushing

    Rope Routing

    Mug Grasping

    Pile Sweeping

    Acknowledgements

    This work was partially supported by the Toyota Research Institute, the DARPA TIAMAT program (HR0011-24-9-0430), NSF Award #2409661, Samsung Research America, and an Amazon Research Award (Fall 2024). This article solely reflects the opinions and conclusions of its authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

    We thank Wenlong Huang and Xiaochen Hu for valuable suggestions on the project release. We also thank Hongkai Dai, Basile Van Hoorick, Keyi Shen, Zach Witzel, Jaisel Singh, Binghao Huang, Kaifeng Zhang, Hanxiao Jiang, and other RoboPIL members for helpful discussions during the project.