Our world model is an action-conditioned video prediction model: given the initial frame and a sequence of actions, it predicts future frames purely in pixel space, without any physics engine. Here you can interact with it directly. Select one of the four tasks below, click Connect, and use your keyboard to control the robot end-effectors. Note: the cloud server may be laggy or unavailable due to resource limits. For a smooth, low-latency experience, consider running the server locally and using Connect Locally.
Our Interactive World Simulator supports diverse robot manipulation tasks, ranging from rigid-body contact (T pushing) to deformable object manipulation (rope routing) and object pile manipulation (sweeping objects into a bin). Each video below is a live interaction from a teleoperation device. Note that the videos on the monitor and the left side are from open-loop live prediction. They are NOT from real cameras. The frames are generated entirely by our interactive world simulators in response to action sequences.
Collecting real-world robot demonstrations is expensive and time-consuming. Our world simulator enables data collection using only teleoperation devices. But the question is: can generated demonstrations actually substitute for real-world demonstrations when training imitation learning policies? Our answer is YES!
We train imitation learning policies using 100% generated data and 0% real data. The following videos are autonomous imitation policy rollouts in the real world. We show that we can train imitation policies using pure simulation data, that are robust under continuous perturbations.
To quantitatively study the generated data quality, we benchmark four imitation learning policies, including DP, ACT, π0, and π0.5, across five manipulation tasks: T pushing (in MuJoCo simulation and in the real world), pile sweeping, mug grasping, and rope routing. For all experiments, we fix the total dataset size at 100 demonstration episodes and vary the proportion of generated data versus real-world data in the training set, sweeping across the full spectrum from 0% to 100% world-model data. We find that performance remains comparable across the entire data mixture spectrum, indicating that our world-model data quality is comparable to real-world data for imitation policy training.
Our world model can serve as a scalable, reproducible, and faithful policy evaluator. The following qualitative videos show that a good policy that succeeds in our world simulator also succeeds on the real robot, while a bad policy that fails in the world model fails in the real world too. Different policies, both good ones and bad ones, show similar behavior in the world model and in the real world, demonstrating strong correlation between our world model and the real world.
Real World
World Model
Real World
World Model
Real World
World Model
Real World
World Model
Real World
World Model
Real World
World Model
Real World
World Model
Real World
World Model
To quantitatively verify this, we evaluate four policies (DP, ACT, π0, π0.5) across four tasks, measuring task scores both inside the world model and on the real robot, under identical conditions across all policies and checkpoints. We make sure comparisons are fair by ensuring the same initial configurations within both real world and world model. In the following figure, each point corresponds to a policy checkpoint evaluated under identical settings. We observe strong positive correlations, indicating that evaluation within our world simulator faithfully reflects relative policy performance in the real world.
Long-horizon stability is one of the hardest problems for video-prediction world models due to accumulated errors. Our world model runs open-loop for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. The videos below show live interaction from a teleoperation device. Note that videos on the monitor and the left side are from open-loop live prediction. They are NOT from real cameras.
This work was partially supported by the Toyota Research Institute, the DARPA TIAMAT program (HR0011-24-9-0430), NSF Award #2409661, Samsung Research America, and an Amazon Research Award (Fall 2024). This article solely reflects the opinions and conclusions of its authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
We thank Wenlong Huang and Xiaochen Hu for valuable suggestions on the project release. We also thank Hongkai Dai, Basile Van Hoorick, Keyi Shen, Zach Witzel, Jaisel Singh, Binghao Huang, Kaifeng Zhang, Hanxiao Jiang, and other RoboPIL members for helpful discussions during the project.