Pre-Training Robo-Centric World Models For Efficient Visual Control

Robo-Centric World Model

Interpolate start reference image.

Abstract

Humans can accurately anticipate their movements to behave as expected in various manipulation tasks. We are inspired to propose that integrating prior knowledge of robot dynamics into world models can effectively improve the sample efficiency of model-based reinforcement learning (MBRL) in visual robot control tasks. In this paper, we introduce the Robo-Centric World Model (RCWM), which explicitly decouples the robot dynamics from the environment and enables pre-training to learn generalized and robust robot dynamics as prior knowledge to accelerate learning new tasks. Specifically, we construct respective dynamics models for the robot and the environment and learn their interactions through cross-attention mechanism. With the mask-guided reconfiguration mechanism, we only need a few prior robot segmentation masks to guide the RCWM to disentangle the robot and environment features and learn their respective dynamics. Our approach enables independent inference of robot dynamics from the environment, allowing accurate prediction of robot movement across various unseen tasks without being distracted by environmental variations. Our results in Meta-world demonstrate that RCWM is able to efficiently learn robot dynamics, improving sample efficiency for downstream tasks and enhancing policy robustness against environmental disturbances compared to the vanilla world model in DreamerV3.

Method

we introduce the Robo-Centric World Model (RCWM), which can decouple the dynamics between the robot and the environment, and learn their interaction via an interaction model based on the cross-attention mechanism. With RCWM, we can extract robot dynamics through pre-training on upstream tasks with the assistance of prior robot masks, and use this prior knowledge about robot to improve sample efficiency on downstream tasks. RCWM offers several advantages: (1) Accurate prediction of robot dynamics. We find that explicitly modeling robot dynamics individually results in more accurate predictions compared to learning global dynamics with a single model; (2) Robust against environmental disturbance. Due to implicit feature disentanglement, the robot branch is hardly affected by environmental disturbances, providing robust robot representation for the policy; (3) Utilization of prior masks. The architecture of RCWM naturally introduces the use of robot segmentation masks, which can enhance the prediction accuracy of robot dynamics.
Interpolate start reference image.

Results

Sample Efficiency Improvement

Compared to the vanilla world model in Dreamerv3 and TDMPC2, the pre-trained RCWM can utilize the extracted prior knowledge of robot dynamics to further improve the sample efficiency of downstream tasks.

Interpolate start reference image.

Robustness Against Environmental Disturbances

In the face of disturbances caused by changes in the environment, RCWM is more robust than vanilla world model and can provide accurate robot state representations for the policy.

Interpolate start reference image.

Robot Dynamics Learning

The pre-trained RCWM can consistently deliver accurate predictions of robot movements in response to actions when applied on unseen downstream tasks. Even when environmental observations are replaced with random noise, the robot branch can make accurate predictions almost unaffectedly.

Robot-Object Interaction Learning

The interaction model in RCWM can effectively utilize the cross-attention mechanism to capture how the predicted movement of the robot will effect the state of the environment. The environment branch can accurately generate predictions that align with the robot movement predicted by the robot branch.