立即登录
v1.2.5
Recent Vision-Language-Action (VLA) models generalize across embodiments but struggle to quickly align with a new robot’s action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities.
The VLA after post-pre-training (PPT) demonstrates accelerated adaptation to humanoid action space during the post-training phase.
Task: Picking a Mickey Mouse toy placed on the humanoid's right side, a configuration unseen in teleoperation data (which only used front placements).
Left: With post-pre-training, the VLA adapts to grasp from below.
Right: Without post-pre-training, the VLA mimics teleoperated motions, approaching from above.
Zero-Shot Execution of "Pass Water" Task.
Left: First-person view examples of the pass Water task from the Agibot-World dataset.
Right: The post-pre-trained VLA successfully executes this task despite receiving no fine-tuning using real-world teleoperation data for this specific task.
TrajBooster pipeline comprises three stages:
Real Trajectory Extraction: the process begins with extracting end-effector trajectories from source robots. Rather than directly mapping full-body motions to the target humanoid, TrajBooster utilizes the 6D coordinates of dual-arm end-effectors as the goal, enabling a retargeting model within the Isaac Gym simulator to achieve whole-body motion retargeting by tracking this goal.
Retargeting in Simulation: the retargeting model is trained with our heuristic-enhanced harmonized online DAgger algorithm for the target humanoid Unitree G1 to track these reference trajectories using whole-body control. Through this process, the humanoid learns to coordinate its joints such that its end-effectors follow the retargeted goals, effectively mapping low-dimensional reference signals into feasible high-dimensional actions. This stage generates a large volume of action data that is compatible with the morphology of the real-world target humanoid.
Finetuning for real humanoid: using this newly generated data, TrajBooster constructs heterogeneous triplets of the form < source vision, source language, target action >, which link perceptual inputs with humanoid-compatible behaviors. The resulting synthetic dataset is then used to post-pre-train an existing VLA model. With just 10 minutes of additional real-world teleoperation data < target vision, target language, target action >, the post-pre-trained VLA is fine-tuned and deployed on the Unitree G1 across a wide spectrum of whole-body manipulation tasks.
The failures may stem from insufficient depth perception capabilities. Even with the addition of wrist cameras, the model still cannot accurately fold clothes. We believe another critical reason why wrist cameras don't work effectively is that the visual information from wrist cameras in our retargeted data doesn't perfectly match the camera-to-object distances and camera parameters of our actual robot setup.
This represents the two-stage training approach mentioned earlier, built upon the base model. The first stage involves pre-post-training with retargeted data, followed by the second stage where we fine-tune using our teleoperated data for 3,000 steps. Compared to Scenario 4, the robot crouches much lower and spreads both arms wide.
This shows a single-stage fine-tuning approach on the base model, mixing both retargeted data and real teleoperated data. Training was conducted for 1 epoch, totaling 60,000 steps. The left hand attempts to assist, but the crouch is insufficient. Both hands exhibit upward turning motions, demonstrating folding-like behaviors. However, due to incorrect depth estimation, the task ultimately fails.
Building upon Scenario 2, this version includes an additional 3,000 steps of training with real teleoperated data. Here, the right hand attempts to grasp the clothes.
This demonstrates the effect of direct fine-tuning without using retargeted data, trained for 10,000 steps (since we previously found that 3,000 steps resulted in 0% success rate). The model shows virtually no tendency toward clothes folding, with both hands remaining closed. The arm movement trajectory in the first half of the video resembles directly reaching for objects, similar to our teleoperated grasping motions for the Mickey Mouse toy. However, since the model's vision doesn't detect Mickey Mouse at the target location, it doesn't attempt to grasp it (this should be attributed to freezing the backbone during our fine-tuning of GR00T N1.5).
Website modified from TWIST.