PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis

1 Harbin Institute of Technology, Harbin, China
2 Huawei Noah's Ark Lab
arXiv Code
Research Teaser Image

Overview of PhysWorld. The framework first constructs a physics-consistent digital twin from videos, then uses it to generate diverse 4D demonstrations, and finally trains a GNN-based world model for real-time future state prediction.

Abstract

Interactive world models that simulate object dynamics are crucial for robotics, VR, and AR. However, it remains a significant challenge to learn physics-consistent dynamics models from limited real-world video data, especially for deformable objects with spatially-varying physical properties. To overcome the challenge of data scarcity, we propose PhysWorld, a novel framework that utilizes a simulator to synthesize physically plausible and diverse demonstrations to learn efficient world models. Specifically, we first construct a physics-consistent digital twin within MPM simulator via constitutive model selection and global-to-local optimization of physical properties. Subsequently, we apply part-aware perturbations to the physical properties and generate various motion patterns for the digital twin, synthesizing extensive and diverse demonstrations. Finally, using these demonstrations, we train a lightweight GNN-based world model that is embedded with physical properties. The real video can be used to further refine the physical properties. PhysWorld achieves accurate and fast future predictions for various deformable objects, and also generalizes well to novel interactions. Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state-of-the-art method, i.e., PhysTwin. The code and pre-trained models will be publicly available.

Action-Conditioned Future Prediction

Visual results of action-conditioned future prediction.

Comparison with PhysTwin

More visual results of action-conditioned future prediction compared with PhysTwin. Our method’s predicted positions show closer alignment with ground truth compared to PhysTwin.

Research Teaser Image

Quantitative results on action-conditioned future prediction and inference speed (FPS). Best metric values are bolded, while second-best ones are underlined. The metrics demonstrate both the accuracy and effectiveness of PhysWorld in both predicting motion and generating realistic images.

Unseen Interaction

Generalization to unseen interactions. As representative examples, we consider two unseen interaction scenarios: lifting a pushed rope and rotating a lifted sloth. The results show that PhysWorld generates physically plausible predictions, while PhysTwin suffers from artifacts such as fracture-like rope distortions and unnatural foot folding.

Model-Based Planning

Research Teaser Image

Examples of model-based planning. With MPPI control, the rope and the zebra doll are transferred from the initial configurations to the target ones.

BibTeX

@misc{yang2025physworldrealvideosworld,
      title={PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis}, 
      author={Yu Yang and Zhilu Zhang and Xiang Zhang and Yihan Zeng and Hui Li and Wangmeng Zuo},
      year={2025},
      eprint={2510.21447},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.21447}, 
}