Learning Robotic Video Dynamics with Heterogeneous Masked Autoregression

Abstract

We propose Heterogeneous Masked Autoregression (HMA) for modeling action-video dynamics to generate high-quality data and evaluation in scaling robot learning. Building inter- active video world models and policies for robotics is difficult due to the the challenge in handling diverse settings while re- maining computationally efficient enough to run in real time. HMA uses heterogeneous pre-training from observations and action sequences across different robotic embodiments, do- mains, and tasks. Masked autoregression is used to generate quantized or soft tokens for video predictions.

HMA achieves better visual fidelity and controllability than the previous state-of-the-art robotic video generation models with 15x faster speed in the real world. After post-training, this model can be used as a video simulator from low-level action inputs for evaluating policies and generating synthetic data

Learning Robotic Video Dynamics with

Heterogeneous Masked Autoregression

TL;DR: HMA is a real-time robotic video simulation for high-fidelity and controllable interactions, leveraging the general masked autoregressive dynamic models and heterogeneous training.

HMA can simulate real-world interactive behaviors in real time with user actions.

HMA can simulate high-fidelity physical interactions such as pushing the blocks as well as the shadows.

Trained on only 200 trajectories, HMA can roll out over 100 frames synthetic data generation and policy evaluation.

Abstract

HMA Framework

HMA Architecture

HMA Dynamics

Scaling Experiments