V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

Songjia He1, *, Zixuan Chen1, *, Hongyu Ding1, Dian Shao2, Jieqi Shi1, Chenxu Li1, Jing Huo1, Yang Gao1
1Nanjing University       2Northwestern Polytechnical University
*These authors contributed equally to this work.

Abstract

Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is pro hibitively expensive, and existing simulators are often con strained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer em ploys a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing di verse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.

Method Pipeline

V-Dreamer Pipeline

BibTeX

@article{}