BC-Z
UCSD Pick and Place
UCSD Kitchen
Tokyo
Dobb-E
Bridge
Austin Sirius
Push-T



Learning Robotic Video Dynamics with

Heterogeneous Masked Autoregression

MIT, UIUC, FAIR Meta

TL;DR: HMA is a real-time robotic video simulation for high-fidelity and controllable interactions, leveraging the general masked autoregressive dynamic models and heterogeneous training.

HMA can simulate real-world interactive behaviors in real time with user actions.

Demo: Recommend to download the code and play with the demos.



HMA can simulate high-fidelity physical interactions such as pushing the blocks as well as the shadows.
Trained on only 200 trajectories, HMA can roll out over 100 frames synthetic data generation and policy evaluation.

Figure: Success cases are displayed on the left, while failure cases are shown on the right. Videos demonstrate outcomes for comparison.



Abstract

We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building inter- active video world models and policies for robotics is difficult due to the the challenge in handling diverse settings while re- maining computationally efficient enough to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, do- mains, and tasks. Masked autoregression is used to generate quantized or soft tokens for video predictions.

HMA achieves better visual fidelity and controllability than the previous state-of-the-art robotic video generation models with 15x faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data

HMA Framework

Interpolate start reference image. HMA pretrains video dynamic models from hetereogeneous data over 40 datasets and 3 million trajectories from real robot teleops, human videos, simulation, and can be post-trained for applications such as video simulation, policy evaluation, and synthetic data generation.

HMA Architecture

Interpolate start reference image.
HMA network architecture incorporates token concatenation and modulaton for action conditioned masked autoregressive video and action generation.

HMA Dynamics

Interpolate start reference image.
The HMA formulation uses masked autoregression to efficiently model action-conditioned video prediction and policy action predictions.

Scaling Experiments

Interpolate start reference image.
HMA shows objective scaling behaviors across multiple axes during training including datasets, model sizes, and number of trajectories.